Frequently asked questions and tips

Here we provide tips for working with the epidist package, and answers to frequently asked questions. If you have a question about using the package, please create an issue. We will endeavour to respond promptly!

Click to expand for code to reproduce the examples in this vignette

library(epidist)
library(brms)
library(dplyr)
library(ggplot2)
library(scales)
library(tidyr)

set.seed(1)

# Define different parameters for each location
meanlog_0 <- 1.6
sdlog_0 <- 0.4
meanlog_1 <- 2.1
sdlog_1 <- 0.6

obs_time <- 25
sample_size <- 200

# Create base data with location assignments
base_data <- simulate_gillespie(seed = 101)
base_data$location <- rbinom(n = nrow(base_data), size = 1, prob = 0.5)

# Create location 0 data
location_0_data <- base_data |>
  filter(location == 0) |>
  simulate_secondary(
    meanlog = meanlog_0,
    sdlog = sdlog_0
  ) |>
  select(case, ptime, delay, stime, location)

# Create location 1 data
location_1_data <- base_data |>
  filter(location == 1) |>
  simulate_secondary(
    meanlog = meanlog_1,
    sdlog = sdlog_1
  ) |>
  select(case, ptime, delay, stime, location)

# Combine datasets
obs_cens_trunc_samp <- bind_rows(location_0_data, location_1_data) |>
  mutate(
    ptime_lwr = floor(.data$ptime),
    ptime_upr = .data$ptime_lwr + 1,
    stime_lwr = floor(.data$stime),
    stime_upr = .data$stime_lwr + 1,
    obs_time = obs_time
  ) |>
  filter(.data$stime_upr <= .data$obs_time) |>
  slice_sample(n = sample_size, replace = FALSE)

linelist_data <- as_epidist_linelist_data(
  obs_cens_trunc_samp$ptime_lwr,
  obs_cens_trunc_samp$ptime_upr,
  obs_cens_trunc_samp$stime_lwr,
  obs_cens_trunc_samp$stime_upr,
  obs_time = obs_cens_trunc_samp$obs_time,
  covariates = data.frame(location = obs_cens_trunc_samp$location)
)
data <- as_epidist_marginal_model(linelist_data)

fit <- epidist(
  data,
  formula = mu ~ 1,
  seed = 1,
  chains = 2,
  cores = 2,
  refresh = ifelse(interactive(), 250, 0),
  iter = 1000,
  backend = "cmdstanr"
)

fit_location <- epidist(
  data,
  formula = bf(mu ~ location, sigma ~ location),
  seed = 1,
  chains = 2,
  cores = 2,
  refresh = ifelse(interactive(), 250, 0),
  iter = 1000,
  backend = "cmdstanr"
)

Working with posterior samples

The output of a call to epidist is compatible with typical Stan workflows. We recommend use of the posterior package for working with samples.

Getting a `data.frame` of posterior samples

The function posterior::as_draws_df() may be used to obtain a dataframe of MCMC draws for specified parameters.

library(posterior)
draws <- as_draws_df(fit, variable = c("Intercept", "Intercept_sigma"))
head(draws)

## # A draws_df: 6 iterations, 1 chains, and 2 variables
##   Intercept Intercept_sigma
## 1       1.8           -0.59
## 2       1.8           -0.56
## 3       1.8           -0.49
## 4       1.8           -0.61
## 5       1.9           -0.53
## 6       1.9           -0.52
## # ... hidden reserved variables {'.chain', '.iteration', '.draw'}

Using random variables (`rvars`) for uncertainty propagation

The posterior package also provides the rvar class for working with posterior samples as random variables. This approach allows you to perform mathematical operations directly with posterior distributions while propagating uncertainty.

library(posterior)
# Convert model parameters to rvars
rv <- as_draws_rvars(fit_location)

# Look at the intercept in rvar form
rv$b_Intercept

## rvar<500,2>[1] mean ± sd:
## [1] 1.6 ± 0.063

# Calculate the mean delay (on natural scale) for each location
mean_delay_loc0 <- exp(rv$b_Intercept + 0.5 * exp(rv$b_sigma_Intercept)^2)
mean_delay_loc1 <- exp(rv$b_Intercept +
  rv$b_location + 0.5 * (exp(rv$b_sigma_Intercept + rv$b_sigma_location))^2)

# Summarize the distributions
summary(mean_delay_loc0)

## # A tibble: 1 × 10
##   variable         mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 mean_delay_loc0  5.54   5.48 0.449 0.389  4.95  6.32  1.00     661.     685.

summary(mean_delay_loc1)

## # A tibble: 1 × 10
##   variable         mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
##   <chr>           <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 mean_delay_loc1  12.8   11.4  5.45  2.82  8.24  21.4  1.01     208.     197.

# Calculate the difference between locations
delay_diff <- mean_delay_loc1 - mean_delay_loc0
delay_diff

## rvar<500,2>[1] mean ± sd:
## [1] 7.3 ± 5.5

For more details, see the posterior package documentation.

How can I assess whether sampling has converged?

The output of a call to epidist is compatible with typical Stan workflows. We recommend use of the bayesplot package for sampling diagnostic plots. For example, the function bayesplot::mcmc_trace() can be used to produce traceplots for specified parameters.

library(bayesplot)
mcmc_trace(fit, pars = c("Intercept", "Intercept_sigma"))

We also provide a function epidist_diagnostics() which can be used to obtain common diagnostics used to assess the quality of a fitted model.

epidist_diagnostics(fit)

## # A tibble: 1 × 8
##    time samples max_rhat divergent_transitions per_divergent_transitions
##   <dbl>   <dbl>    <dbl>                 <dbl>                     <dbl>
## 1  2.45    1000     1.00                     0                         0
## # ℹ 3 more variables: max_treedepth <dbl>, no_at_max_treedepth <int>,
## #   per_at_max_treedepth <dbl>

I’d like to run a simulation study

We recommend use of the purrr package for running many epidist models, for example as a part of a simulation study. We particularly highlight two functions which might be useful:

purrr::map() (and other similar functions) for iterating over a list of inputs.
purrr::safely() which ensures that the function called “always succeeds”. In other words, if there is an error it will be captured and output, rather than ending computation (and potentially disrupting a call to purrr::map()).

For an example use of these functions, have a look at the epidist-paper repository containing the code for Park et al. (2024). (Note that in that codebase, we use map as a part of a targets pipeline.)

How are the default priors for `epidist` chosen?

brms provides default priors for all parameters. However, some of those priors do not make sense in the context of our application. Instead, we used prior predictive checking to set epidist-specific default priors which produce epidemiological delay distribution mean and standard deviation parameters in a reasonable range.

For example, for the brms::lognormal() latent individual model, we suggest the following prior distributions for the brms mu and sigma intercept parameters:

# Note that we export lognormal() as part of epidist hence no need for brms::
family <- lognormal()

epidist_family <- epidist_family(data, family)
epidist_formula <- epidist_formula(
  data,
  family = epidist_family,
  formula = mu ~ 1
)

# NULL here means no replacing priors from the user!
epidist_prior <- epidist_prior(
  data = data,
  family = family,
  formula = epidist_formula,
  prior = NULL
)

epidist_prior

##                 prior     class coef group resp  dpar nlpar lb ub  source
##    student_t(3, 5, 3) Intercept                                   default
##  student_t(3, 0, 2.5) Intercept                 sigma             default

(Note that the functions epidist_family() and epidist_prior() are mostly for internal use!)

Here are the distributions on the delay distribution mean and standard deviation parameters that these prior distributions imply:

set.seed(1)
fit_ppc <- epidist(
  data = data,
  formula = mu ~ 1,
  family = lognormal(),
  sample_prior = "only",
  seed = 1,
  backend = "cmdstanr"
)

pred <- predict_delay_parameters(fit_ppc)

pred |>
  as.data.frame() |>
  pivot_longer(
    cols = c("mu", "sigma", "mean", "sd"),
    names_to = "parameter",
    values_to = "value"
  ) |>
  filter(parameter %in% c("mean", "sd")) |>
  ggplot(aes(x = value, y = after_stat(density))) +
  geom_histogram() +
  facet_wrap(. ~ parameter, scales = "free") +
  labs(x = "", y = "Density") +
  theme_minimal() +
  scale_x_log10(labels = comma)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

quantile(pred$mean, c(0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99))

##         1%        10%        25%        50%        75%        90%        99% 
##  0.3172760  0.8667007  1.6163280  3.2461963  6.6291796 12.0852881 34.0415909

quantile(pred$sd, c(0.01, 0.1, 0.25, 0.5, 0.75, 0.9, 0.99))

##         1%        10%        25%        50%        75%        90%        99% 
##  0.1335505  0.3950603  0.7687356  1.7107449  3.9206478  8.1671089 33.4421590

How can I assess how sensitive the fitted posterior distribution is to the prior distribution used?

We recommend use of the priorsense package (Kallioinen et al. 2024) to check how sensitive the posterior distribution is to perturbations of the prior distribution and likelihood using power-scaling analysis:

library(priorsense)
powerscale_plot_dens(fit, variable = c("Intercept", "Intercept_sigma")) +
  theme_minimal()

What do the parameters in my model output correspond to?

The epidist package uses brms to fit models. This means that the model output will include brms-style names for parameters. Here, we provide a table giving the correspondence between the distributional parameter names used in brms and those used in standard R functions for some common likelihood families.

Family	`brms` parameter	R parameter
`lognormal()`	`mu`	`meanlog`
`lognormal()`	`sigma`	`sdlog`
`Gamma()`	`mu`	`shape / scale`
`Gamma()`	`shape`	`shape`
`weibull()`	`mu`	`scale * gamma(1 + 1 / shape)`
`weibull()`	`shape`	`shape`

Note that all families in brms are parameterised with some measure of centrality mu as their first parameter. This parameter does not necessarily correspond to the mean: hence the provision of a function add_mean_sd() within epidist to add columns containing the natural scale mean and standard deviation to a data.frame of draws.

How can I generate predictions with my fitted `epidist` model?

It is possible to generate predictions manually by working with samples from the model output. However this is tricky to do, and so where possible we recommend using the tidybayes package. In particular, following functions may be useful:

tidybayes::add_epred_draws() for predictions of the expected value of a delay.
tidybayes::add_linpred_draws() for predictions of the delay distributional parameter linear predictors.
tidybayes::add_predicted_draws() for predictions of the observed delay.

To see these functions demonstrated in a vignette, see “Advanced features with Ebola data”. As a short example, to generate 4000 predictions (equal to the number of draws) of the delay that would be observed with a double censored observation process (in which the primary and secondary censoring windows are both one) then:

library(tidybayes)

## 
## Attaching package: 'tidybayes'

## The following objects are masked from 'package:brms':
## 
##     dstudent_t, pstudent_t, qstudent_t, rstudent_t

draws_pmf <- tibble::tibble(
  relative_obs_time = Inf, pwindow = 1, swindow = 1, delay_upr = NA
) |>
  add_predicted_draws(fit)

ggplot(draws_pmf, aes(x = .prediction)) +
  geom_bar(aes(y = after_stat(count / sum(count)))) +
  labs(x = "Delay", y = "PMF") +
  scale_x_continuous(limits = c(0, 30)) +
  theme_minimal()

## Warning: Removed 6 rows containing non-finite outside the scale range
## (`stat_count()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

Importantly, this functionality is only available for epidist models using brms families that have a log_lik_censor method implemented internally in brms. If you are using another family, consider submitting a pull request to implement these methods!

How can I use the `marginaleffects` package with my fitted `epidist` model?

The marginaleffects package provides tools for computing and visualising marginal effects, contrasts, and predictions from regression models. It works with epidist models because they are built on top of brms.

For epidist models with covariates, you can use marginaleffects to: - Compute average marginal effects - Make comparisons between different covariate values - Visualise model predictions across the range of covariates

Here’s a simple example using a model that includes location as a covariate:

library(marginaleffects)

# Compare outcomes between location categories (0 and 1)
avg_comparisons(
  fit_location,
  variables = list(location = c(0, 1))
)

## 
##  Estimate 2.5 % 97.5 %
##      5.86  2.21   20.2
## 
## Term: location
## Type:  response 
## Comparison: 1 - 0

For more details on the available functions and their use, see the marginaleffects documentation.

How can I use the `cmdstanr` backend?

The cmdstanr backend is typically more performant than the default rstan backend. To use the cmdstanr backend, we first need to install CmdStan (see the README for more details). We can check we have everything we need as follows:

cmdstanr::cmdstan_version()

## [1] "2.36.0"

References

Kallioinen, Noa, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari. 2024. “Detecting and Diagnosing Prior and Likelihood Sensitivity with Power-Scaling.” Statistics and Computing 34 (1): 57.

Park, Sang Woo, Andrei R. Akhmetzhanov, Kelly Charniga, Anne Cori, Nicholas G. Davies, Jonathan Dushoff, Sebastian Funk, et al. 2024. “Estimating Epidemiological Delay Distributions for Infectious Diseases.” medRxiv. https://doi.org/10.1101/2024.01.12.24301247.

Working with posterior samples

Getting a data.frame of posterior samples

Using random variables (rvars) for uncertainty propagation