---
title: "Travel Times"
author: "Mark Padgham & Alexandra Kapp"
date: "`r Sys.Date()`"
output: 
  html_document:
    toc: true
    toc_float: true
    number_sections: false
    theme: flatly
vignette: >
  %\VignetteIndexEntry{Travel Times}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
  
```{r DTthread, echo = FALSE}
# Necessary for CRAN to avoid CPU / elapsed time ratios being too high
nthr <- data.table::setDTthreads (1)
```

`gtfsrouter` includes a function,
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
to efficiently calculate travel times from a nominated station to all other
stations within a GTFS feed. The function takes only two main parameters, the
first specifying a departure station, and the second specifying a pair of
`start_time_limits` determining the earliest and latest possible departure
times from that station. The function will then return the fastest connections
to all possible end stations for services departing within the specified
`start_time_limits`.

For example, travel times to all stations for services leaving a nominated
station between 12:00 and 13:00 can be extracted by specifying
`start_time_limits = c (12, 13) * 3600`. The following code uses the internal
Berlin data to demonstrate, starting by reading the GTFS data and constructing
a timetable for a specified day.  This second step, achieved by calling the
[`gtfs_timetable()`
function](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_timetable.html),
is necessary prior to calculating travel times.

```{r berlin_gtfs}
library (gtfsrouter)
berlin_gtfs_to_zip ()
f <- file.path (tempdir (), "vbb.zip")
gtfs <- extract_gtfs (f, quiet = TRUE)
gtfs <- gtfs_timetable (gtfs, day = 3)
```

The 
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
function may then be called by specifying a start station (`from`), and
`start_time_limits`, as in the following code:

```{r traveltimes}
from <- "Alexanderplatz"
start_time_limits <- c (12, 13) * 3600
tt <- gtfs_traveltimes (
    gtfs,
    from = from,
    start_time_limits = start_time_limits
)
head (tt)
```

The function returns a `data.frame` in which each row details the fastest
connection to each station, in terms of the start time and duration of that
trip, as well as the number of transfers necessary.

## Maximum Traveltimes

The previous code returns a `data.frame` detailing traveltimes for the following stations:

```{r}
nrow (tt)
```

This relatively low number is because the [`berlin_gtfs_to_zip()`
function](https://UrbanAnalyst.github.io/gtfsrouter/reference/berlin_gtfs_to_zip.html)
creates only a small sample portion of a full feed, extending over only one
hour and containing a small number of stations. The total number of stations
reachable from the `r from` station during these times are:

```{r}
nrow (gtfs$stops)
```

The
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
returns travel times to only a subset of all potentially reachable stations 
(`r nrow (tt)` instead of `r nrow (gtfs$stops)`) for two main reasons. First,
not all stations may actually be reachable from a given station, and travel
times will only be returned for those that are. Secondly, the
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
function has an additional parameter, `max_traveltime`, and only returns travel
times to stations able to be reached within this upper limit. The default value
is 3600, or one hour. For the token feed used here, the maximum travel times
returned are:

```{r maxtt}
library (hms)
hms (as.integer (max (tt$duration)))
```

In this case that value actually reflects the restricted data included in the
sample data set, but applying the function to an actual GTFS feed will by
default record traveltimes to all stations reachable within one hour. The
following shows a more realistic example using a full version of the Berlin
GTFS data:

```{r get-vbb, eval = FALSE}
gtfs <- extract_gtfs ("/<path>/<to>/vbb.zip")
gtfs <- gtfs_timetable (gtfs, day = 3)
tt <- gtfs_traveltimes (
    gtfs,
    from = from,
    start_time_limits = start_time_limits
)
nrow (tt)
hms::hms (as.integer (max (tt$duration)))
## [1] 8556
## 01:00:00
```

The number of stations reached in the full feed is far more, and the maximum
trip duration is indeed precisely one hour. This parameter of `max_traveltime`
defaults to the relatively short value of one hour because calculation times
increase non-linearly with numbers of stations reached. The following graphs
show calculation times as a function both of `max_traveltime`, and numbers of
stations reached.

<details>
<summary>Click one the arrow to the left to see code used to generate these results</summary>
<p>

```{r timing1, eval = FALSE}
maxt <- 3600 + 0:10 * 1800 # 1-6 hours in half-hour intervals
dat <- vapply (
    maxt, function (i) {
        st <- system.time (
            res <- gtfs_traveltimes (
                gtfs,
                from = from,
                start_time_limits = start_time_limits,
                max_traveltime = i
            )
        )
        return (c (st [3], nrow (res))) },
    numeric (2)
)
dat <- data.frame (
    max_time = maxt / 3600, # in hours
    calc_time = dat [1, ],
    n_stns = dat [2, ] / 1000
)
par (mfrow = c (1, 2))
plot (dat$max_time, dat$calc_time,
    pch = 19, col = "gray",
    xlab = "Max Traveltime (hours)",
    ylab = "Calculation Time (seconds)"
)
lines (dat$max_time, dat$calc_time)
plot (dat$n_stns, dat$calc_time,
    pch = 19, col = "gray",
    xlab = "Thousands of Stations Reached",
    ylab = "Calculation Time (seconds)"
)
lines (dat$n_stns, dat$calc_time)
```
</p>
</details>
<br>

```{r timing-manual, echo = FALSE}
maxt <- 3600 + 0:10 * 1800
calc_time <- c (
    0.914, 1.246, 1.911, 2.156, 2.333, 2.513, 2.889, 3.344, 3.784,
    4.200, 4.661
)
n_stns <- c (
    8556, 12530, 15989, 19364, 21752, 23628, 24628, 25004, 25191,
    25295, 25352
)
dat <- data.frame (
    max_time = maxt / 3600, # in hours
    calc_time = calc_time,
    n_stns = n_stns / 1000
)
par (mfrow = c (1, 2))
plot (dat$max_time, dat$calc_time,
    pch = 19, col = "gray",
    xlab = "Max Traveltime (hours)",
    ylab = "Calculation Time (seconds)"
)
lines (dat$max_time, dat$calc_time)
plot (dat$n_stns, dat$calc_time,
    pch = 19, col = "gray",
    xlab = "Thousands of Stations Reached",
    ylab = "Calculation Time (seconds)"
)
lines (dat$n_stns, dat$calc_time)
```

Those graphs show that increasing the `max_traveltime` parameter leads to
approximately linear increases in the calculation times required for 
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
to execute, while relationships with numbers of stations actually reached are
highly non-linear. The panel on the right side shows that increases in
`max_traveltime` eventually have little effect on increasing numbers of
stations reached, yet increases computation times considerably. Values for
`max_traveltime` should accordingly only be adjusted after first determining
appropriate values. In particular, note that although the Berlin GTFS has
a large number of distinct stops:

```{r vbb-stops, eval = FALSE}
nrow (gtfs$stops)
## [1] 41577
```

Many of these reflect `stop_id` values for multiple platforms at a single station. The number of distinct stop names is in fact much less:

```{r vbb-stop-names, eval = FALSE}
length (unique (gtfs$stops$stop_name))
## [1] 13090
```

A rule-of-thumb is to increase the `max_traveltime` parameter until the number
of stations reached exceeds by some small amount the total number of actual
stops in a system. In the case of these Berlin data, a maximum traveltime of
2 hours is sufficient.

```{r vbb-2hours, eval = FALSE}
nrow (gtfs_traveltimes (
    gtfs,
    from = from,
    start_time_limits = start_time_limits,
    max_traveltime = 7200
))
## [1] 15989
```

These additional stations reached naturally tend to be stations on the
periphery of the system, while doubling the `max_traveltime` parameter
approximately doubles the calculation time (in this case, according to the
graphs shown above). Appropriate values will depend on the nature of desired
results. Analyses focussed on more central parts of a system can use smaller
values, with shorter corresponding calculation times. Analyses for which
peripheral stations are important may need to use larger values of
`max_traveltimes`, yet should do so carefully to avoid undue increases in
calculation times.

## Fastest Routes versus Minimal-Transfer Routes

The
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
function has an additional parameter, `minimise_transfers`, which can be used
to return either the fastest possible connections from a start station to all
other reachable stations within a system, or the fastest connection which has
the least possible number of transfers. Connections with the least transfers
may be slower than fastest possible connections, as shown by the following
analysis.

```{r min-transfers, eval = FALSE}
tt_fastest <- gtfs_traveltimes (
    gtfs,
    from = from,
    start_time_limits = start_time_limits
)
tt_min_tr <- gtfs_traveltimes (
    gtfs,
    from = from,
    start_time_limits = start_time_limits,
    minimise_transfers = TRUE
)
# non-dplyr join:
tt_fastest <- tt_fastest [tt_fastest$stop_id %in% tt_min_tr$stop_id, ]
tt_min_tr <- tt_min_tr [tt_min_tr$stop_id %in% tt_fastest$stop_id, ]
dat <- data.frame (
    stop_id = tt_fastest$stop_id,
    fastest_dur = as.numeric (tt_fastest$duration / 3600), # hours
    fastest_ntr = tt_fastest$ntransfers,
    min_tr_dur = as.numeric (tt_min_tr$duration / 3600),
    min_tr_ntr = tt_min_tr$ntransfers
)

60 * mean (dat$min_tr_dur - dat$fastest_dur) # in minutes
## [1] 3.957052
```
```{r ntr-diff, eval = FALSE}
mean (dat$fastest_ntr - dat$min_tr_ntr)
## [1] 0.2818428
```

Minimal-transfer trips take on average just under 4 minutes longer, while
requiring 0.28 fewer trips. Most fastest trips are nevertheless also
minimal-transfer trips, as can be seen by the number of identical values
returned by these two queries.

```{r ntr-prop, eval = FALSE}
length (which (dat$min_tr_ntr == dat$fastest_ntr)) / nrow (dat)
## [1] 0.6875221
```

So almost 70% of faster trips are also trips with fewest number of transfers,
with around 30% of fastest trips involving more than the minimal possible
number of transfers.

## The Traveltimes Algorithm

The 
[`gtfs_traveltimes()`](https://UrbanAnalyst.github.io/gtfsrouter/reference/gtfs_traveltimes.html),
is based on a new algorithm specifically developed for this package, and which
will be described here in further detail in subsequent releases of the package.

```{r DTthread-reset, echo = FALSE}
data.table::setDTthreads (nthr)
```