Many quantities in epidemiology can be thought of as the time between two events or “delays”. Important examples include:
- the incubation period (time between infection and symptom onset),
- serial interval (time between symptom onset of infectee and symptom onset of infected), and
- generation interval (time between infection of infectee and infection of infected).
We encompass all of these delays as the time between a “primary event” and “secondary event”. Unfortunately, estimating delays accurately from observational data is usually difficult. In our experience, the two main issues are:
- interval censoring, and
- right truncation.
Don’t worry if you’ve not come across these terms before.
First, in Section 1, we explain interval censoring and right truncation by simulating data like that we might observe during an ongoing infectious disease outbreak.
Then, in Section 2, we show how epidist
can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.
If you would like more technical details, the epidist
package implements models following best practices as described in Park et al. (2024) and Charniga et al. (2024).
We also recommend the introduction provided by the nowcasting and forecasting infectious disease dynamics course.
To run this vignette yourself, as well as the epidist
package, you will need the following packages:
1 Example data
First, we use the Gillepsie algorithm to generate infectious disease outbreak data (Figure 1.1) from a stochastic compartmental model.
outbreak <- simulate_gillespie(seed = 101)
outbreak |>
filter(case %% 50 == 0) |>
ggplot(aes(x = ptime, y = case)) +
geom_point(col = "#56B4E9") +
labs(x = "Primary event time (day)", y = "Case number") +
theme_minimal()
outbreak
is a data.frame
with the two columns: case
and ptime
.
Here ptime
is a numeric column giving the time of infection.
In reality, it is more common to recive primary event times as a date rather than a numeric.
head(outbreak)
#> case ptime
#> 1 1 0.04884052
#> 2 2 0.06583120
#> 3 3 0.21857827
#> 4 4 0.24963421
#> 5 5 0.30133392
#> 6 6 0.31425010
To generate secondary events, we will use a lognormal distribution (Figure 1.2) for the delay between primary and secondary events:
secondary_dist <- data.frame(mu = 1.8, sigma = 0.5)
class(secondary_dist) <- c("lognormal_samples", class(secondary_dist))
secondary_dist <- add_mean_sd(secondary_dist)
obs <- outbreak |>
simulate_secondary(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
ggplot(data.frame(x = c(0, 30)), aes(x = x)) +
geom_function(
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
theme_minimal() +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
)
obs |>
filter(case %% 50 == 0) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
geom_point(aes(x = ptime), col = "#56B4E9") +
geom_point(aes(x = stime), col = "#009E73") +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
obs
is now a data.frame
with further columns for delay
and stime
.
The secondary event time is simply the primary event time plus the delay:
all(obs$ptime + obs$delay == obs$stime)
#> [1] TRUE
If we were to receive the complete data obs
as above then it would be simple to accurately estimate the delay distribution.
However, in reality, during an outbreak we almost never receive the data as above.
First, the times of primary and secondary events will usually be censored. This means that rather than exact event times, we observe event times within an interval. Here we suppose that the interval is daily, meaning that only the date of the primary or secondary event, not the exact event time, is reported (Figure 1.4):
obs_cens <- obs |>
mutate(
ptime_lwr = floor(.data$ptime),
ptime_upr = .data$ptime_lwr + 1,
stime_lwr = floor(.data$stime),
stime_upr = .data$stime_lwr + 1,
delay_daily = stime_lwr - ptime_lwr
)
obs_cens |>
filter(case %% 50 == 0, case <= 250) |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
) +
# The primary event censoring intervals
geom_errorbarh(
aes(xmin = ptime_lwr, xmax = ptime_upr, y = case, yend = case),
col = "#56B4E9", height = 5
) +
# The secondary event censoring intervals
geom_errorbarh(
aes(xmin = stime_lwr, xmax = stime_upr, y = case, yend = case),
col = "#009E73", height = 5
) +
geom_point(aes(x = ptime), fill = "white", col = "#56B4E9", shape = 21) +
geom_point(aes(x = stime), fill = "white", col = "#009E73", shape = 21) +
labs(x = "Event time (day)", y = "Case number") +
theme_minimal()
During an outbreak we will usually be estimating delays in real time. The result is that only those cases with a secondary event occurring before some time will be observed. This is called (right) truncation, and biases the observation process towards shorter delays:
obs_time <- 25
obs_cens_trunc <- obs_cens |>
mutate(obs_time = obs_time) |>
filter(.data$stime_upr <= .data$obs_time)
Finally, in reality, it’s not possible to observe every case.
We suppose that a sample of individuals of size sample_size
are observed:
sample_size <- 200
This sample size corresponds to 8.7% of the data.
obs_cens_trunc_samp <- obs_cens_trunc |>
slice_sample(n = sample_size, replace = FALSE)
Another issue, which epidist
currently does not account for, is that sometimes only the secondary event might be observed, and not the primary event.
For example, symptom onset may be reported, but start of infection unknown.
Discarding events of this type leads to what are called ascertainment biases.
Whereas each case is equally likely to appear in the sample above, under ascertainment bias some cases are more likely to appear in the data than others.
With our censored, truncated, and sampled data, we are now ready to try to recover the underlying delay distribution using epidist
.
2 Fit the model and compare estimates
If we had access to the complete and unaltered obs
, it would be simple to estimate the delay distribution.
However, with only access to obs_cens_trunc_samp
, the delay distribution we observe is biased (Figure 2.1) and we must use a statistical model.
bind_rows(
obs_cens |>
mutate(type = "Complete, retrospective data") |>
select(delay = delay_daily, type),
obs_cens_trunc_samp |>
mutate(type = "Censored, truncated,\nsampled data") |>
select(delay = delay_daily, type)
) |>
group_by(type, delay, .drop = FALSE) |>
summarise(n = n()) |>
mutate(p = n / sum(n)) |>
ggplot() +
geom_col(
aes(x = delay, y = p, fill = type, group = type),
position = position_dodge2(preserve = "single")
) +
scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
geom_function(
data = data.frame(x = c(0, 30)), aes(x = x),
fun = dlnorm,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density",
fill = ""
) +
theme_minimal() +
theme(legend.position = "bottom")
The main function you will use for modelling is called epidist()
1.
We will fit the model "epidist_latent_model"
which uses latent variables for the time of primary and secondary event of each individual2.
To do so, we first prepare the data
using as_epidist_latent_model()
:
linelist_data <- as_epidist_linelist_data(
obs_cens_trunc_samp$ptime_lwr,
obs_cens_trunc_samp$ptime_upr,
obs_cens_trunc_samp$stime_lwr,
obs_cens_trunc_samp$stime_upr,
obs_time = obs_cens_trunc_samp$obs_time
)
data <- as_epidist_latent_model(linelist_data)
class(data)
#> [1] "epidist_latent_model" "epidist_linelist_data" "tbl_df"
#> [4] "tbl" "data.frame"
The data
object now has the class epidist_latent_model
.
Using this data
, we now call epidist()
to fit the model.
The parameters of the model are inferred using Bayesian inference.
In particular, we use the the No-U-Turn Sampler (NUTS) Markov chain Monte Carlo (MCMC) algorithm via the brms
R package (Bürkner 2017).
fit <- epidist(
data = data, chains = 2, cores = 2, refresh = as.integer(interactive())
)
Note that here we use the default rstan
backend but we generally recommend using the cmdstanr
backend for faster sampling and additional features.
This can be set using backend = "cmdstanr"
after following the installing CmdStan instructions in the README.
The fit
object is a brmsfit
object containing MCMC samples from each of the parameters in the model, shown in the table below.
Many of these parameters (e.g. swindow
and pwindow
) are the so called latent variables, and have lengths corresponding to the sample_size
.
pars <- fit |>
variables(reserved = FALSE) |>
gsub(pattern = "\\[\\d+\\]", replacement = "")
data.frame(
Parameter = unique(pars), Length = table(pars)
) |>
gt()
Parameter | Length.pars | Length.Freq |
---|---|---|
b_Intercept | b_Intercept | 1 |
b_sigma_Intercept | b_sigma_Intercept | 1 |
Intercept | Intercept | 1 |
Intercept_sigma | Intercept_sigma | 1 |
lprior | lp__ | 1 |
lp__ | lprior | 1 |
swindow_raw | pwindow | 200 |
pwindow_raw | pwindow_raw | 200 |
pwindow | swindow | 200 |
swindow | swindow_raw | 200 |
Users familiar with Stan and brms
, can work with fit
directly.
Any tool that supports brms
fitted model objects will be compatible with fit
.
For example, we can use the built in summary()
function
summary(fit)
#> Family: latent_lognormal
#> Links: mu = identity; sigma = log
#> Formula: delay | vreal(relative_obs_time, pwindow, swindow) ~ 1
#> sigma ~ 1
#> Data: data (Number of observations: 200)
#> Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
#> total post-warmup draws = 2000
#>
#> Regression Coefficients:
#> Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
#> Intercept 1.75 0.05 1.66 1.85 1.00 2474 1373
#> sigma_Intercept -0.77 0.07 -0.90 -0.63 1.00 2071 1326
#>
#> Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
#> and Tail_ESS are effective sample size measures, and Rhat is the potential
#> scale reduction factor on split chains (at convergence, Rhat = 1).
to see that the model has converged and recovered the delay distribution distribution paramters reasonably well.
The epidist
package also provides functions to make common post-processing tasks easy.
For example, individual predictions of the lognormal delay parameters can be extracted using:
pred <- predict_delay_parameters(fit)
Figure 2.2 shows the lognormal delay distribution obtained for 100 draws from the posterior distribution of the mu
and sigma
parameters.
Whereas in Figure 2.1 the histogram of censored, truncated, sampled data was substantially different to the underlying delay distribution, using epidist()
we have obtained a much closer match to the truth.
# Sample 100 random draws from the posterior
set.seed(123)
sample_draws <- sample(nrow(pred), 100)
ggplot() +
# Plot the true distribution
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
linewidth = 1.5,
args = list(
meanlog = secondary_dist[["mu"]],
sdlog = secondary_dist[["sigma"]]
)
) +
# Plot 100 sampled posterior distributions
lapply(sample_draws, \(i) {
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
args = list(
meanlog = pred$mu[i],
sdlog = pred$sigma[i]
),
alpha = 0.1,
color = "#CC79A7"
)
}) +
labs(
x = "Delay between primary and secondary event (days)",
y = "Probability density"
) +
theme_minimal()