Lecture 14
March 12, 2024
Let \(t_0\) the “true” value of a statistic, \(\hat{t}\) the estimate of the statistic from the sample, and \((\tilde{t}_i)\) the bootstrap estimates.
Depends on trust in model “correctness”: - Do we trust the model specification to be reasonably correct? - Do we trust that we have enough samples to recover the empirical CDF? - Do we trust the data-generating process?
Rejection sampling (or importance sampling) to draw i.i.d. samples from a proposal density and reject/re-weight based on target.
Both require \(f(x) \leq M g(x)\) for some \(1 < M < \infty\).
A wishlist:
Suppose we want to sample a probability distribution \(f(\cdot)\) and are at a parameter vector \(x\).
What if we had a method that would let us stochastically jump from \(x\) to a new vector \(y\) in such a way that, eventually, we would visit any given vector wiith probability \(f\)?
This would let us trade convenience/flexibility for dependent samples.
There is a mathematical process that has these properties: Markov Chains.
These methods are called Markov chain Monte Carlo (MCMC).
Consider a stochastic process \(\{X_t\}_{t \in \mathcal{T}}\), where

This stochastic process is a Markov chain if it satisfies the Markovian (or memoryless) property: \[\begin{align*} \mathbb{P}(X_{T+1} = s_i &| X_1=x_1, \ldots, X_T=x_T) = \\ &\qquad\mathbb{P}(X_{T+1} = s_i| X_T=x_T) \end{align*} \]

Suppose the weather can be foggy, sunny, or rainy.
Based on past experience, we know that:
Suppose that today’s weather depends on the prior two days.
We can summarize these probabilities in a transition matrix \(P\): \[ P = \begin{array}{cc} \begin{array}{ccc} \phantom{i}\color{red}{F}\phantom{i} & \phantom{i}\color{red}{S}\phantom{i} & \phantom{i}\color{red}{R}\phantom{i} \end{array} \\ \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/2 & 0 & 1/2 \\ 1/4 & 1/4 & 1/2 \end{pmatrix} & \begin{array}{ccc} \color{red}F \\ \color{red}S \\ \color{red}R \end{array} \end{array} \]
Rows are the current state, columns are the next step, so \(\sum_i p_{ij} = 1\).
Denote by \(\lambda^t\) a probability distribution over the states at time \(t\).
Then \(\lambda^t = \lambda^{t-1}P\):
\[\begin{pmatrix}\lambda^t_F & \lambda^t_S & \lambda^t_R \end{pmatrix} = \begin{pmatrix}\lambda^{t-1}_F & \lambda^{t-1}_S & \lambda^{t-1}_R \end{pmatrix} \begin{pmatrix} 1/2 & 1/4 & 1/4 \\ 1/2 & 0 & 1/2 \\ 1/4 & 1/4 & 1/2 \end{pmatrix} \]
Notice that \[\lambda^{t+i} = \lambda^t P^i,\] so multiple transition probabilities are \(P\)-exponentials.
\[P^3 = \begin{array}{cc} \begin{array}{ccc} \phantom{iii}\color{red}{F}\phantom{ii} & \phantom{iii}\color{red}{S}\phantom{iii} & \phantom{ii}\color{red}{R}\phantom{iii} \end{array} \\ \begin{pmatrix} 26/64 & 13/64 & 25/64 \\ 26/64 & 12/64 & 26/64 \\ 26/64 & 13/64 & 26/64 \end{pmatrix} & \begin{array}{ccc} \color{red}F \\ \color{red}S \\ \color{red}R \end{array} \end{array} \]
What happens if we let the system run for a while starting from an initial sunny day?
current = [1.0, 0.0, 0.0]
P = [1/2 1/4 1/4
1/2 0 1/2
1/4 1/4 1/2]
T = 21
state_probs = zeros(T, 3)
state_probs[1,:] = current
for t=1:T-1
state_probs[t+1, :] = state_probs[t:t, :] * P
end
p = plot(0:T-1, state_probs, label=["Foggy" "Sunny" "Rainy"], palette=:mk_8, linewidth=3)
xlabel!("Time")
ylabel!("State Probability")
plot!(p, size=(1000, 350))This stabilization always occurs when the probability distribution is an eigenvector of \(P\) with eigenvalue 1:
\[\pi = \pi P.\]
This is called an invariant or a stationary distribution.
This is a property called ergodicity (or the chain is ergodic). Ergodic Markov chains always have a limiting distribution which is the limit of the time-evolution of the chain dynamics, e.g. \[\pi_j = \lim_{t \to \infty} \mathbb{P}(X_t = s_j).\]
Key: The limiting distribution is independent of the initial state probability.
For an ergodic chain, the limiting distribution is the unique stationary distribution (we won’t prove uniqueness):
\[ \begin{align} \pi_j &= \lim_{t \to \infty} \mathbb{P}(X_t = s_j | X_0 = s_i) \\ &= \lim_{t \to \infty} (P^{t+1})_{ij} = \lim_{t \to \infty} (P^tP)_{ij} \\ &= \lim_{t \to \infty} \sum_d (P^t)_{id} P_{dj} \\ &= \sum_d \pi_d P_{dj} \end{align} \]
Proving that a chain is ergodic is getting into the mathematical weeds a bit (and is outside the scope of this class).
The good news: The goal of any MCMC algorithm is to construct an ergodic chain where the stationary distribution \(\pi(\cdot)\) is the target \(f(\cdot)\).
This means that if you’re using a “standard” algorithm, the existence of a stationary distribution for the produced Markov chain is mathematically guaranteed.
The portion of the chain prior to convergence to the stationary distribution is called the transient portion.
Monday: MCMC
Wednesday: Cross-Validation and Model Skill