# Approximate Bayesian inference in epidist

Source:`vignettes/approx-inference.Rmd`

`approx-inference.Rmd`

## 1 Background

The `epidist`

package uses Bayesian inference to estimate delay distributions and other quantities.
Doing Bayesian inference amounts to approximating the posterior distribution of each parameter in the statistical model^{1}.
A range of methods exist to perform this approximation.

By default, `epidist`

uses the No-U-Turn Sampler [NUTS; Hoffman, Gelman, et al. (2014)] Hamiltonian Monte Carlo (HMC) algorithm.
NUTS is an example of a broader class of Markov chain Monte Carlo (MCMC) methods.
These methods work by simulating from a Markov chain with the target posterior distribution as its stationary distribution.
When MCMC algorithms are run for sufficiently many iterations, and have reached convergence, the samples can be treated as being drawn from the posterior distribution.
Relevant posterior quantities such as expectations may then be computed using these samples.
A drawback of MCMC methods, like NUTS, is that simulations can be quite computational intensive, especially for complex models or large data.

The `epidist`

package is built using `brms`

(Bürkner 2017), which stands for “Bayesian Regression Models using Stan”, where Stan (Carpenter et al. 2017) is a probabilistic programming language.
Although NUTS the the primary inference algorithm used in Stan, additional options are available.
These additional inference algorithms are also available in `epidist`

due to its dependence on `brms`

.

In this vignette, we first briefly describe the alternative algorithms available (Section 2) as well as directing you to more detailed resources. Then (Section 3) we demonstrate their application to fitting simulated data, before extracting and comparing posterior distributions. By comparing the resulting inferences to those from NUTS, we hope to help you make informed decisions about which algorithm to use in your applied problem.

## 2 Alternative approximate inference algorithms

Here we describe three alternative approximate Bayesian inference algorithms that are available to use in `brms`

, and therefore also available in `epidist`

.
It’s worth noting that further inference algorithms may have become available since this vignette was last updated.
Please check `brms`

package updates if interested!

### 2.1 Laplace method

The Laplace method approximates a posterior distribution by a Gaussian distribution centered (by default) at the posterior mode.
In Stan, the Gaussian approximation is constructed on the unconstrained parameter space (as the domain of a Gaussian distribution is the real line).
Samples from the Gaussian approximation may then be transformed to the constrained parameter space.
To access the Laplace method, specify `algorithm = "laplace"`

within `brms::brm()`

.
See the section Laplace Sampling of the `CmdStan`

User’s Guide for more information.

### 2.2 Variational inference using ADVI

Automatic differentiation variational inference [ADVI; Kucukelbir et al. (2017)] is a type of variational inference [VI; Blei, Kucukelbir, and McAuliffe (2017)] algorithm.
VI works by restricting to a family of distributions, and then selecting the member of that family which is the most similar to the posterior distribution.
Most commonly, and in Stan, (dis-)similarity is measured using the Kullback–Leibler (KL) divergence.
There are two options for the family of distributions, either a fully factorised Gaussian with `algorithm = "meanfield"`

or a Gaussian with a full-rank covariance matrix with `algorithm = "fullrank"`

.
See the section Variational Inference using ADVI of the `CmdStan`

User’s Guide for more information.

### 2.3 Pathfinder

Pathfinder is a method closely related to variational inference, which has been more recently developed by Zhang et al. (2022).
It works by generating Gaussian approximations along each step of an iterative optimisation algorithm (such as L-BFGS).
The KL divergence from each approximation to the posterior is measured, with the best approximation chosen.
Pareto smoothed importance sampling [PSIS; Vehtari et al. (2015)] is optionally used to resample draws from the chosen Gaussian distribution.
When multiple paths are specified (using `num_paths`

) then the Pathfinder algorithm is run multiple times, initialising the optimisation at different points.
The resulting approximation is a mixture of Gaussian distributions, rather than a single Gaussian distribution.
See the section Pathfinder Method for Approximate Bayesian Inference of the `CmdStan`

User’s Guide for more information.

## 3 Demonstration

In this demonstration, we use the following packages:

```
library(epidist)
library(ggplot2)
library(dplyr)
library(gt)
library(purrr)
library(tidyr)
library(tibble)
library(tidybayes)
```

First, we begin by simulating data. The example data simulation process follows that used in the Getting started with epidist vignette, so we will not detail exactly what is happening here, but please consult that vignette if interested:

```
meanlog <- 1.8
sdlog <- 0.5
obs_time <- 25
sample_size <- 200
obs_cens_trunc <- simulate_gillespie(seed = 101) |>
simulate_secondary(
meanlog = meanlog,
sdlog = sdlog
) |>
observe_process() |>
filter_obs_by_obs_time(obs_time = obs_time)
obs_cens_trunc_samp <-
obs_cens_trunc[sample(seq_len(.N), sample_size, replace = FALSE)]
```

We now prepare the data for fitting with the latent individual model, and perform inference with HMC:

```
data <- as_latent_individual(obs_cens_trunc_samp)
t <- proc.time()
fit_hmc <- epidist(data = data, algorithm = "sampling")
time_hmc <- proc.time() - t
```

Note that for clarity above we specify `algorithm = "sampling"`

, but if you were to call `epidist(data = data)`

the result would be the same since `"sampling"`

(i.e. HMC) is the default value for the `algorithm`

argument.

Now, we fit^{2} the same latent individual model using each method in Section 2.
To match the four Markov chains of length 1000 in HMC above, we then draw 4000 samples from each approximate posterior.

```
t <- proc.time()
fit_laplace <- epidist(data = data, algorithm = "laplace", draws = 4000)
time_laplace <- proc.time() - t
t <- proc.time()
fit_advi <- epidist(data = data, algorithm = "meanfield", draws = 4000)
time_advi <- proc.time() - t
```

For the Pathfinder algorithm we will set `num_paths = 1`

.
Although in testing both the Laplace and ADVI methods ran without problem in all cases, we found Pathfinder often produced “Error evaluating model log probability: Non-finite gradient.”.
Although a `save_single_paths`

option is available, which may have allowed recovery of individual Pathfinder paths (and therefore removing faulty paths), it does not appear to be working currently^{3}.

```
t <- proc.time()
fit_pathfinder <- epidist(
data = data, algorithm = "pathfinder", draws = 4000, num_paths = 1
)
#> Warning: Number of PSIS draws is larger than the total number of draws returned by the single Pathfinders. This is likely unintentional and leads to re-sampling from the same draws.
#> Path [1] :Initial log joint density = -1520.740908
#> Error evaluating model log probability: Non-finite gradient.
#> Path [1] : Iter log prob ||dx|| ||grad|| alpha alpha0 # evals ELBO Best ELBO Notes
#> 66 -9.381e+02 4.145e-04 3.228e-03 1.925e-01 1.000e+00 1651 -4.023e+02 -4.023e+02
#> Path [1] :Best Iter: [66] ELBO (-402.301946) evaluations: (1651)
#> Finished in 0.4 seconds.
time_pathfinder <- proc.time() - t
```

We now extract posterior distribution for the delay parameters from the fitted model for each inference method. Thankfully, each algorithm is implemented to sample draws from the posterior distribution, and so post-processing is simple.

```
fits <- list(
"HMC" = fit_hmc,
"Laplace" = fit_laplace,
"ADVI" = fit_advi,
"Pathfinder" = fit_pathfinder
)
draws <- imap(fits, function(fit, name) {
predict_delay_parameters(fit) |>
as.data.frame() |>
pivot_longer(
cols = c("mu", "sigma", "mean", "sd"),
names_to = "parameter",
values_to = "value"
) |>
filter(parameter %in% c("mu", "sigma")) |>
mutate(method = as.factor(name))
})
draws <- bind_rows(draws)
```

### 3.1 Comparison of parameter posterior distributions

The mean estimated value of each parameter, from each method is as follows.

```
pars <- draws |>
group_by(method, parameter) |>
summarise(value = mean(value)) |>
pivot_wider(names_from = parameter, values_from = value)
pars |>
ungroup() |>
gt()
```

method | mu | sigma |
---|---|---|

HMC | 1.747895 | 0.4622867 |

Laplace | 1.730603 | 0.4267119 |

ADVI | 1.799455 | 0.4442930 |

Pathfinder | 1.729763 | 0.4282183 |

More comprehensively, the estimated posterior distributions are shown in Figure 3.1.

```
draws |>
ggplot(aes(x = value, col = method)) +
stat_slabinterval(density = "histogram", breaks = 30, alpha = 0.8) +
scale_colour_manual(values = c("#56B4E9", "#009E73", "#E69F00", "#CC79A7")) +
facet_grid(method ~ parameter, scales = "free_x") +
theme_minimal() +
guides(fill = "none") +
labs(x = "", y = "", col = "Method") +
theme(legend.position = "bottom")
```

### 3.2 Comparison of resulting delay distributions

Figure 3.2 shows how the different `mu`

and `sigma`

posterior mean estimates from each inference method alter an estimated delay distribution.

```
pmap_df(
filter(pars), ~ tibble(
x = seq(0, 25, by = 0.1),
method = ..1, density = dlnorm(x, ..2, ..3)
)
) |>
ggplot(aes(x = x, y = density, col = method)) +
geom_line() +
scale_color_manual(values = c("#56B4E9", "#009E73", "#E69F00", "#CC79A7")) +
theme_minimal() +
labs(x = "", y = "", col = "Method")
```

### 3.3 Comparison of time taken

In this example, HMC took much longer than the other methods, with Pathfinder being the fastest method. That said, even for HMC the computation time in this case is unlikely to be prohibitive.

```
times <- list(
"HMC" = time_hmc,
"Laplace" = time_laplace,
"ADVI" = time_advi,
"Pathfinder" = time_pathfinder
)
times |>
map_dbl("elapsed") |>
enframe(name = "method", value = "time (s)") |>
gt()
```

method | time (s) |
---|---|

HMC | 28.644 |

Laplace | 2.196 |

ADVI | 1.773 |

Pathfinder | 1.166 |

## 4 Conclusion

The range of alternative approximation algorithms available, and their ease of use, is an attractive feature of `brms`

.
We found that these algorithms do produce reasonable approximations in far less time than HMC.
Of course, this vignette only includes one example, and a more thorough investigation would be required to make specific recommendations.
That said, currently we do not recommend use of the Pathfinder algorithm due to its unstable performance in our testing, and early stage software implementation.

### Bibliography

*Journal of the American Statistical Association*112 (518): 859–77.

*Journal of Statistical Software*80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.

*Journal of Statistical Software*76.

*J. Mach. Learn. Res.*15 (1): 1593–623.

*Journal of Machine Learning Research*18 (14): 1–45. http://jmlr.org/papers/v18/16-107.html.

*arXiv Preprint arXiv:1507.02646*.

*Journal of Machine Learning Research*23 (306): 1–49. http://jmlr.org/papers/v23/21-0889.html.