Bayesian Statistics

Lecture 08

February 19, 2024

Review

Last Class(es)

Probability models for dynamical systems/simulation models
Generative model: can include discrepancy and/or observational errors
- Model data, not expected value (regression)
Maximize likelihood over model and statistical parameters.

Non-Uniqueness of MLE

Many models do not have well-defined maximum likelihoods.

Non-Identifiability

\[\underbrace{h_t}_{\substack{\text{hare} \\ \text{pelts}}} \sim \text{LogNormal}(\log(\underbrace{p_H}_{\substack{\text{trap} \\ \text{rate}}} H_T), \sigma_H)\] \[l_t \sim \text{LogNormal}(\log(p_L L_T), \sigma_L)\]

\[ \begin{align*} \frac{dH}{dt} &= H_t b_H - H_t (L_t m_H) \\ H_T &= H_1 + \int_1^T \frac{dH}{dt}dt \end{align*} \]

\[ \begin{align*} \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \\ L_T &= L_1 + \int_1^T \frac{dL}{dt}dt \end{align*} \]

Non-Uniqueness of MLE

Many models do not have well-defined maximum likelihoods.
Can be due to multi-modality or “ridges”.
Sometimes also referred to as equifinality.
Poses problems for MLE.

Bayesian Statistics

Prior Information

So far: no way to use prior information about parameters (other than bounds on MLE optimization).

For example: what “trap rates” are more plausible?

Bayes’ Rule

Original version (Bayes, 1763):

\[P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \quad \text{if} \quad P(B) \neq 0.\]

Bayes’ Rule

“Modern” version (Laplace, 1774):

\[\underbrace{{p(\theta | y)}}_{\text{posterior}} = \frac{\overbrace{p(y | \theta)}^{\text{likelihood}}}{\underbrace{p(y)}_\text{normalization}} \overbrace{p(\theta)}^\text{prior}\]

Bayes’ Rule (Ignoring Normalizing Constants)

The version of Bayes’ rule which matters the most for 95% (approximate) of Bayesian statistics:

\[p(\theta | y) \propto p(y | \theta) \times p(\theta)\]

“The posterior is the prior times the likelihood…”

Credible Intervals

Bayesian credible intervals are straightforward to interpret: \(\theta\) is in \(I\) with probability \(\alpha\).

Choose \(I\) such that \[p(\theta \in I | \mathbf{y}) = \alpha.\]

Bayesian Model Components

A fully specified Bayesian model includes:

Prior distributions over the parameters, \(p(\theta)\)
Probability model for the data given the parameters (the likelihood), \(p(y | \theta)\)t

Think: Prior provides proposed explanations, likelihood re-weights based on ability to produce the data.

Generative Modeling

Bayesian models lend themselves towards generative simulation by generating new data \(\tilde{y}\) through the posterior predictive distribution:

\[p(\tilde{y} | \mathbf{y}) = \int_{\Theta} p(\tilde{y} | \theta) p(\theta | \mathbf{y}) d\theta\]

How To Choose A Prior?

One perspective: Priors should reflect “actual knowledge” independent of the analysis (Jaynes, 2003)

Another: Priors are part of the probability model, and can be specified/changed accordingly based on predictive skill (Gelman et al., 2017; Gelman & Shalizi, 2013)

What Makes A Good Prior?

Reflects level of understanding (informative vs. weakly informative vs. non-informative).
Does not zero out probability of plausible values.
Regularization (extreme values should be less probable)

What Makes A Bad Prior?

Assigns probability zero to plausible values;
Weights implausible values equally as more plausible ones;
Double counts information (e.g. fitting a prior to data which is also used in the likelihood)
Chosen based on vibes.
Personal opinion: Uniform distributions

A Coin Flipping Example

We would like to understand if a coin-flipping game is fair. We’ve observed the following sequence of flips:

flips = ["H", "H", "H", "T", "H", "H", "H", "H", "H"]

9-element Vector{String}:
 "H"
 "H"
 "H"
 "T"
 "H"
 "H"
 "H"
 "H"
 "H"

Coin Flipping Likelihood

The data-generating process here is straightforward: we can represent a coin flip with a heads-probability of \(\theta\) as a sample from a Bernoulli distribution,

\[y_i \sim \text{Bernoulli}(\theta).\]

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
θ_mle = Optim.optimize(θ -> -flip_ll(θ), 0, 1).minimizer
round(θ_mle, digits=2)

0.89

Coin Flipping Prior

Suppose that we spoke to a friend who knows something about coins, and she tells us that it is extremely difficult to make a passable weighted coin which comes up heads more than 75% of the time.

Coin Flipping Prior

Since \(\theta\) is bounded between 0 and 1, we’ll use a Beta distribution for our prior, specifically \(\text{Beta}(5,5)\).

Code

prior_dist = Beta(5, 5)
plot(prior_dist; label=false, xlabel=L"$θ$", ylabel=L"$p(θ)$", linewidth=3, tickfontsize=16, guidefontsize=18)
plot!(size=(500, 500))

Figure 1: Beta prior for coin flipping example

Maximum A Posteriori Estimate

Combining using Bayes’ rule lets us calculate the maximum a posteriori (MAP) estimate:

flip_ll(θ) = sum(logpdf(Bernoulli(θ), flips .== "H"))
flip_lprior(θ) = logpdf(Beta(5, 5), θ)
flip_lposterior(θ) = flip_ll(θ) + flip_lprior(θ)
θ_map = Optim.optimize(θ -> -(flip_lposterior(θ)), 0, 1).minimizer
round(θ_map, digits=2)

0.71

Coin Flipping Posterior Distribution

Code

θ_range = 0:0.01:1
plot(θ_range, flip_lposterior.(θ_range), color=:black, label="Posterior", linewidth=3)
plot!(θ_range, flip_ll.(θ_range), color=:black, label="Likelihood", linewidth=3, linestyle=:dash)
plot!(θ_range, flip_lprior.(θ_range), color=:black, label="Prior", linewidth=3, linestyle=:dot)
vline!([θ_map], color=:red, label="MAP", linewidth=2)
vline!([θ_mle], color=:blue, label="MLE", linewidth=2)
xlabel!(L"$\theta$")
ylabel!("Log-Density")
plot!(size=(1000, 450))

Figure 2: Posterior distribution for the coin-flipping example

Bayes and Parametric Uncertainty

Frequentist: Parametric uncertainty is purely the result of sampling variability

Bayesian: Parameters have probabilities based on consistency with data and priors.

Think: how “likely” is a set of parameters to have produced the data given the specified data generating process?

Bayesian Updating

The posterior is a “compromise” between the prior and the data.
The posterior mean is a weighted combination of the data and the prior mean.
The weights depend on the prior and the likelihood variances.
More data usually makes the posterior more confident.

Key Points

Bayesian probability: parameters have probabilities conditional on data
Need to specify prior distribution (think generatively!).
Posterior distribution reflects compromise between prior and likelihood.
Maximum a posteriori gives “most probable” parameter values

Key Points: Priors

Use prior predictive simulations to refine priors.
Be transparent and principled about prior choices (sensitivity analyses?).
Don’t choose priors based on convenience.
Will talk more about general sampling later.

Upcoming Schedule

Next Classes

Next Week: Sampling! Specifically, Monte Carlo.

Assessments

Homework 2 due Friday (2/21).

Quiz: Due Monday (all on today’s lecture).

Project: Will discuss Monday, start thinking about possible topics.

References

References (Scroll for Full List)

Bayes, T. (1763). An Essay towards Solving a Problem in the Doctrine of Chance. Philosophical Transactions of the Royal Society of London, 53, 370–418.

Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. Br. J. Math. Stat. Psychol., 66, 8–38. https://doi.org/10.1111/j.2044-8317.2011.02037.x

Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be understood in the context of the likelihood. Entropy, 19, 555. https://doi.org/10.3390/e19100555

Jaynes, E. T. (2003). Probability theory: the Logic of Science. (G. L. Bretthorst, Ed.). Cambridge, UK ; New York, NY: Cambridge University Press. Retrieved from https://market.android.com/details?id=book-tTN4HuUNXjgC

Laplace, P. S. (1774). Mémoire sur la Probabilité des Causes par les évènemens. In Mémoires de Mathematique et de Physique, Presentés à l’Académie Royale des Sciences, Par Divers Savans & Lus Dans ses Assemblées (pp. 621–656).