Skip to contents

Many quantities in epidemiology can be thought of as the time between two events or “delays”. Important examples include:

  • the incubation period (time between infection and symptom onset),
  • serial interval (time between symptom onset of infectee and symptom onset of infected), and
  • generation interval (time between infection of infectee and infection of infected).

We encompass all of these delays as the time between a “primary event” and “secondary event”. Unfortunately, estimating delays accurately from observational data is usually difficult. In our experience, the two main issues are:

  1. interval censoring, and
  2. right truncation.

Don’t worry if you’ve not come across these terms before. First, in Section 1, we explain interval censoring and right truncation by simulating data like that we might observe during an ongoing infectious disease outbreak. Then, in Section 2, we show how epidist can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.

If you would like more technical details, the epidist package implements models following best practices as described in Park et al. (2024) and Charniga et al. (2024).

To run this vignette yourself, along with the epidist package, you will need the following packages:

1 Example data

Data should be formatted as a data.frame with the following columns for use within epidist:

  • case: The unique case ID.
  • ptime: The time of the primary event.
  • stime: The time of the secondary event.

Here we simulate data in this format, and in doing so explain the two main issues with observational delay data.

First, we use the Gillepsie algorithm to generate infectious disease outbreak data (Figure 1.1) from a stochastic compartmental model:

outbreak <- simulate_gillespie(seed = 101)
outbreak |>
  filter(case %% 50 == 0) |>
  ggplot(aes(x = ptime, y = case)) +
  geom_point(col = "#56B4E9") +
  labs(x = "Primary event time (day)", y = "Case number") +
  theme_minimal()
Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)

Figure 1.1: Early on in the epidemic, there is a high rate of growth in new cases. As more people are infected, the rate of growth slows. (Only every 50th case is shown to avoid over-plotting.)

outbreak is a data.frame with the columns case and ptime.

Now, to generate secondary events, we will use a lognormal distribution (Figure 1.2) for the delay between primary and secondary events:

secondary_dist <- data.frame(mu = 1.8, sigma = 0.5)
class(secondary_dist) <- c("lognormal_samples", class(secondary_dist))
secondary_dist <- add_mean_sd(secondary_dist)
ggplot(data.frame(x = c(0, 30)), aes(x = x)) +
  geom_function(
    fun = dlnorm,
    args = list(
      meanlog = secondary_dist[["mu"]],
      sdlog = secondary_dist[["sigma"]]
    )
  ) +
  theme_minimal() +
  labs(
    x = "Delay between primary and secondary event (days)",
    y = "Probability density"
  )
The lognormal distribution is skewed to the right. Long delay times still have some probability.

Figure 1.2: The lognormal distribution is skewed to the right. Long delay times still have some probability.

obs <- outbreak |>
  simulate_secondary(
    meanlog = secondary_dist[["mu"]],
    sdlog = secondary_dist[["sigma"]]
  )
obs |>
  filter(case %% 50 == 0) |>
  ggplot(aes(y = case)) +
  geom_segment(
    aes(x = ptime, xend = stime, y = case, yend = case),
    col = "grey"
  ) +
  geom_point(aes(x = ptime), col = "#56B4E9") +
  geom_point(aes(x = stime), col = "#009E73") +
  labs(x = "Event time (day)", y = "Case number") +
  theme_minimal()
Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure 1.2). As with Figure 1.1, to make this figure easier to read, only every 50th case is shown.

Figure 1.3: Secondary events (in green) occur with a delay drawn from the lognormal distribution (Figure 1.2). As with Figure 1.1, to make this figure easier to read, only every 50th case is shown.

obs is now a data.frame with further columns for delay and stime. The secondary event time is simply the primary event time plus the delay:

all(obs$ptime + obs$delay == obs$stime)
#> [1] TRUE

If we were to receive the data obs as above then estimating the delay distribution would be easy, and the epidist package wouldn’t need to exist. However, in reality, during an outbreak we almost never receive the data as above.

First, the times of primary and secondary events will usually be censored. This means that rather than exact event times, we observe event times within an interval. Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure 1.4):

obs_cens <- obs |>
  mutate(
    ptime_lwr = floor(.data$ptime),
    ptime_upr = .data$ptime_lwr + 1,
    stime_lwr = floor(.data$stime),
    stime_upr = .data$stime_lwr + 1,
    delay_daily = stime_lwr - ptime_lwr
  )
ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
  geom_point(col = "#E69F00") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", col = "grey") +
  theme_minimal() +
  coord_fixed() +
  labs(x = "Exact delay time (days)", y = "Censored delay time (days)")
Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, epidist supports the primary and secondary events having other delay intervals.

Figure 1.4: Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, epidist supports the primary and secondary events having other delay intervals.

Next, during an outbreak we will usually be estimating delays in real time. The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays:

obs_time <- 25
obs_cens_trunc <- obs_cens |>
  mutate(obs_time = obs_time) |>
  filter(.data$stime_upr <= .data$obs_time)

Finally, in reality, it’s not possible to observe every case. We suppose that a sample of individuals of size sample_size are observed:

sample_size <- 200

This sample size corresponds to 8.7% of the data.

obs_cens_trunc_samp <- obs_cens_trunc |>
  slice_sample(n = sample_size, replace = FALSE)

Another issue, which epidist currently does not account for, is that sometimes only the secondary event might be observed, and not the primary event. For example, symptom onset may be reported, but start of infection unknown. Discarding events of this type leads to what are called ascertainment biases. Whereas each case is equally likely to appear in the sample above, under ascertainment bias some cases are more likely to appear in the data than others.

With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using epidist.

2 Fit the model and compare estimates

If we had access to the complete and unaltered obs, it would be simple to estimate the delay distribution. However, with only access to obs_cens_trunc_samp, the delay distribution we observe is biased (Figure 2.1) and we must use a statistical model.

bind_rows(
  obs_cens |>
    mutate(type = "Complete, retrospective data") |>
    select(delay = delay_daily, type),
  obs_cens_trunc_samp |>
    mutate(type = "Censored, truncated,\nsampled data") |>
    select(delay = delay_daily, type)
) |>
  group_by(type, delay, .drop = FALSE) |>
  summarise(n = n()) |>
  mutate(p = n / sum(n)) |>
  ggplot() +
  geom_col(
    aes(x = delay, y = p, fill = type, group = type),
    position = position_dodge2(preserve = "single")
  ) +
  scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
  geom_function(
    data = data.frame(x = c(0, 30)), aes(x = x),
    fun = dlnorm,
    args = list(
      meanlog = secondary_dist[["mu"]],
      sdlog = secondary_dist[["sigma"]]
    )
  ) +
  labs(
    x = "Delay between primary and secondary event (days)",
    y = "Probability density",
    fill = ""
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")
The histogram of delays from the complete, retrospective data obs_cens match quite closely with the underlying distribution, whereas those from obs_cens_trunc_samp show more significant systematic bias. In this instance the extent of the bias caused by censoring is less than that caused by right truncation. Nonetheless, we always recommend [Charniga et al. (2024); Table 2] adjusting for censoring when it is present.

Figure 2.1: The histogram of delays from the complete, retrospective data obs_cens match quite closely with the underlying distribution, whereas those from obs_cens_trunc_samp show more significant systematic bias. In this instance the extent of the bias caused by censoring is less than that caused by right truncation. Nonetheless, we always recommend [Charniga et al. (2024); Table 2] adjusting for censoring when it is present.

The main function you will use for modelling is called epidist()1. We will fit the model "epidist_latent_model" which uses latent variables for the time of primary and secondary event of each individual2. To do so, we first prepare the data using as_epidist_latent_model():

linelist_data <- as_epidist_linelist_data(
  obs_cens_trunc_samp$ptime_lwr,
  obs_cens_trunc_samp$ptime_upr,
  obs_cens_trunc_samp$stime_lwr,
  obs_cens_trunc_samp$stime_upr,
  obs_time = obs_cens_trunc_samp$obs_time
)

data <- as_epidist_latent_model(linelist_data)
class(data)
#> [1] "epidist_latent_model"  "epidist_linelist_data" "tbl_df"               
#> [4] "tbl"                   "data.frame"

The data object now has the class epidist_latent_model. Using this data, we now call epidist() to fit the model. The parameters of the model are inferred using Bayesian inference. In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the brms R package (Bürkner 2017).

fit <- epidist(data = data, chains = 2, cores = 2, refresh = 0)

The fit object is a brmsfit object containing MCMC samples from each of the parameters (Table ??) in the model. Users familiar with Stan and brms, can work with fit directly. Any tool that supports brms fitted model objects will be compatible with fit.

Note that here we use the default rstan backend but we generally recommend using the cmdstanr backend for faster sampling and additional features. This can be set using backend = "cmdstanr" after following the installing cmdstan instructions in the README.

pars <- fit |>
  variables(reserved = FALSE) |>
  gsub(pattern = "\\[\\d+\\]", replacement = "")

data.frame(
  Parameter = unique(pars), Length = table(pars)
) |>
  gt() |>
  tab_caption("All of the parameters that are included in the model. Many of these parameters (e.g. swindow and pwindow) are the so called latent variables in the model, and have lengths corresponding to the sample_size. We extracted the model parameters using brms::variables() and removed the indices.")
(\#tab:pars)All of the parameters that are included in the model. Many of these parameters (e.g. swindow and pwindow) are the so called latent variables in the model, and have lengths corresponding to the sample_size. We extracted the model parameters using brms::variables() and removed the indices.
Parameter Length.pars Length.Freq
b_Intercept b_Intercept 1
b_sigma_Intercept b_sigma_Intercept 1
Intercept Intercept 1
Intercept_sigma Intercept_sigma 1
lprior lp__ 1
lp__ lprior 1
swindow_raw pwindow 200
pwindow_raw pwindow_raw 200
pwindow swindow 200
swindow swindow_raw 200

The epidist package also provides functions to make common post-processing tasks easy. For example, individual predictions of the lognormal delay parameters can be extracted using:

Figure 2.2 shows the lognormal delay distribution obtained using the average of the mu and sigma draws. Whereas in Figure 2.1 the histogram of censored, truncated, sampled data was substantially different to the underlying delay distribution, using epidist() we have obtained a much closer match to the truth.

ggplot() +
  geom_function(
    data = data.frame(x = c(0, 30)),
    aes(x = x),
    fun = dlnorm,
    args = list(
      meanlog = secondary_dist[["mu"]],
      sdlog = secondary_dist[["sigma"]]
    )
  ) +
  geom_function(
    data = data.frame(x = c(0, 30)),
    aes(x = x), fun = dlnorm,
    args = list(
      meanlog = mean(pred$mu),
      sdlog = mean(pred$sigma)
    ),
    col = "#CC79A7"
  ) +
  labs(
    x = "Delay between primary and secondary event (days)",
    y = "Probability density"
  ) +
  theme_minimal()
A fitted delay distribution (in pink) as compared with the true delay distribution (in black).

Figure 2.2: A fitted delay distribution (in pink) as compared with the true delay distribution (in black).

Bibliography

Bürkner, Paul-Christian. 2017. brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.
Charniga, Kelly, Sang Woo Park, Andrei R. Akhmetzhanov, Anne Cori, Jonathan Dushoff, Sebastian Funk, Katelyn M. Gostic, et al. 2024. “Best Practices for Estimating and Reporting Epidemiological Delay Distributions of Infectious Diseases.” PLOS Computational Biology 20 (10): 1–21. https://doi.org/10.1371/journal.pcbi.1012520.
Park, Sang Woo, Andrei R. Akhmetzhanov, Kelly Charniga, Anne Cori, Nicholas G. Davies, Jonathan Dushoff, Sebastian Funk, et al. 2024. “Estimating Epidemiological Delay Distributions for Infectious Diseases.” medRxiv. https://doi.org/10.1101/2024.01.12.24301247.