Bayesian Statistics


Lecture 08

February 19, 2024

Review

Last Class(es)

  • Probability models for dynamical systems/simulation models
  • Generative model: can include discrepancy and/or observational errors
    • Model data, not expected value (regression)
  • Maximize likelihood over model and statistical parameters.

Non-Uniqueness of MLE

  • Many models do not have well-defined maximum likelihoods.

Non-Identifiability

\[\underbrace{h_t}_{\substack{\text{hare} \\ \text{pelts}}} \sim \text{LogNormal}(\log(\underbrace{p_H}_{\substack{\text{trap} \\ \text{rate}}} H_T), \sigma_H)\] \[l_t \sim \text{LogNormal}(\log(p_L L_T), \sigma_L)\]

\[ \begin{align*} \frac{dH}{dt} &= H_t b_H - H_t (L_t m_H) \\ H_T &= H_1 + \int_1^T \frac{dH}{dt}dt \end{align*} \]

\[ \begin{align*} \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \\ L_T &= L_1 + \int_1^T \frac{dL}{dt}dt \end{align*} \]

Non-Uniqueness of MLE

  • Many models do not have well-defined maximum likelihoods.
  • Can be due to multi-modality or “ridges”.
  • Sometimes also referred to as equifinality.
  • Poses problems for MLE.

Bayesian Statistics

Prior Information

So far: no way to use prior information about parameters (other than bounds on MLE optimization).

For example: what “trap rates” are more plausible?

Bayes’ Rule

Original version (Bayes, 1763):

\[P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \quad \text{if} \quad P(B) \neq 0.\]

Bayes’ Rule

“Modern” version (Laplace, 1774):

\[\underbrace{{p(\theta | y)}}_{\text{posterior}} = \frac{\overbrace{p(y | \theta)}^{\text{likelihood}}}{\underbrace{p(y)}_\text{normalization}} \overbrace{p(\theta)}^\text{prior}\]

Bayes’ Rule (Ignoring Normalizing Constants)

The version of Bayes’ rule which matters the most for 95% (approximate) of Bayesian statistics:

\[p(\theta | y) \propto p(y | \theta) \times p(\theta)\]

“The posterior is the prior times the likelihood…”

Credible Intervals

Bayesian credible intervals are straightforward to interpret: \(\theta\) is in \(I\) with probability \(\alpha\).

Choose \(I\) such that \[p(\theta \in I | \mathbf{y}) = \alpha.\]

Bayesian Model Components

A fully specified Bayesian model includes:

  1. Prior distributions over the parameters, \(p(\theta)\)
  2. Probability model for the data given the parameters (the likelihood), \(p(y | \theta)\)t

Think: Prior provides proposed explanations, likelihood re-weights based on ability to produce the data.

Generative Modeling

Bayesian models lend themselves towards generative simulation by generating new data \(\tilde{y}\) through the posterior predictive distribution:

\[p(\tilde{y} | \mathbf{y}) = \int_{\Theta} p(\tilde{y} | \theta) p(\theta | \mathbf{y}) d\theta\]

How To Choose A Prior?

One perspective: Priors should reflect “actual knowledge” independent of the analysis (Jaynes, 2003)

Another: Priors are part of the probability model, and can be specified/changed accordingly based on predictive skill (Gelman et al., 2017; Gelman & Shalizi, 2013)

What Makes A Good Prior?

  • Reflects level of understanding (informative vs. weakly informative vs. non-informative).
  • Does not zero out probability of plausible values.
  • Regularization (extreme values should be less probable)

What Makes A Bad Prior?

  • Assigns probability zero to plausible values;
  • Weights implausible values equally as more plausible ones;
  • Double counts information (e.g. fitting a prior to data which is also used in the likelihood)
  • Chosen based on vibes.
  • Personal opinion: Uniform distributions

A Coin Flipping Example

We would like to understand if a coin-flipping game is fair. We’ve observed the following sequence of flips:

flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]
9-element Vector{String}:
 "H"
 "H"
 "H"
 "T"
 "H"
 "H"
 "H"
 "H"
 "H"

Coin Flipping Likelihood

The data-generating process here is straightforward: we can represent a coin flip with a heads-probability of \(\theta\) as a sample from a Bernoulli distribution,

\[y_i \sim \text{Bernoulli}(\theta).\]

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
θ_mle = Optim.optimize-> -flip_ll(θ), 0, 1).minimizer
round(θ_mle, digits=2)
0.89

Coin Flipping Prior

Suppose that we spoke to a friend who knows something about coins, and she tells us that it is extremely difficult to make a passable weighted coin which comes up heads more than 75% of the time.

Coin Flipping Prior

Since \(\theta\) is bounded between 0 and 1, we’ll use a Beta distribution for our prior, specifically \(\text{Beta}(5,5)\).

Code
prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3, tickfontsize=16, guidefontsize=18)
plot!(size=(500, 500))
Figure 1: Beta prior for coin flipping example

Maximum A Posteriori Estimate

Combining using Bayes’ rule lets us calculate the maximum a posteriori (MAP) estimate:

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
flip_lprior(θ) = logpdf(Beta(5, 5), θ)
flip_lposterior(θ) = flip_ll(θ) + flip_lprior(θ)
θ_map = Optim.optimize-> -(flip_lposterior(θ)), 0, 1).minimizer
round(θ_map, digits=2)
0.71

Coin Flipping Posterior Distribution

Code
θ_range = 0:0.01:1
plot(θ_range, flip_lposterior.(θ_range), color=:black, label="Posterior", linewidth=3)
plot!(θ_range, flip_ll.(θ_range), color=:black, label="Likelihood", linewidth=3, linestyle=:dash)
plot!(θ_range, flip_lprior.(θ_range), color=:black, label="Prior", linewidth=3, linestyle=:dot)
vline!([θ_map], color=:red, label="MAP", linewidth=2)
vline!([θ_mle], color=:blue, label="MLE", linewidth=2)
xlabel!(L"$\theta$")
ylabel!("Log-Density")
plot!(size=(1000, 450))
Figure 2: Posterior distribution for the coin-flipping example

Bayes and Parametric Uncertainty

Frequentist: Parametric uncertainty is purely the result of sampling variability

Bayesian: Parameters have probabilities based on consistency with data and priors.

Think: how “likely” is a set of parameters to have produced the data given the specified data generating process?

Bayesian Updating

  • The posterior is a “compromise” between the prior and the data.
  • The posterior mean is a weighted combination of the data and the prior mean.
  • The weights depend on the prior and the likelihood variances.
  • More data usually makes the posterior more confident.

Key Points

Key Points

  • Bayesian probability: parameters have probabilities conditional on data
  • Need to specify prior distribution (think generatively!).
  • Posterior distribution reflects compromise between prior and likelihood.
  • Maximum a posteriori gives “most probable” parameter values

Key Points: Priors

  • Use prior predictive simulations to refine priors.
  • Be transparent and principled about prior choices (sensitivity analyses?).
  • Don’t choose priors based on convenience.
  • Will talk more about general sampling later.

Upcoming Schedule

Next Classes

Next Week: Sampling! Specifically, Monte Carlo.

Assessments

Homework 2 due Friday (2/21).

Quiz: Due Monday (all on today’s lecture).

Project: Will discuss Monday, start thinking about possible topics.

References

References (Scroll for Full List)

Bayes, T. (1763). An Essay towards Solving a Problem in the Doctrine of Chance. Philosophical Transactions of the Royal Society of London, 53, 370–418.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol., 66, 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x
Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19, 555. https://doi.org/10.3390/e19100555
Jaynes, E. T. (2003). Probability theory: the Logic of Science. (G. L. Bretthorst, Ed.). Cambridge, UK ; New York, NY: Cambridge University Press. Retrieved from https://market.android.com/details?id=book-tTN4HuUNXjgC
Laplace, P. S. (1774). Mémoire sur la Probabilité des Causes par les évènemens. In Mémoires de Mathematique et de Physique, Presentés à l’Académie Royale des Sciences, Par Divers Savans & Lus Dans ses Assemblées (pp. 621–656).