Here \(\theta\) can be the MLE or integrated over the sampling or posterior.
Information Criteria
Information Criteria
“Information criteria” refers to a category of estimators of prediction error.
The idea: estimate predictive error using the fitted model.
Information Criteria Overview
There is a common framework for all of these:
\[\widehat{\text{elpd}} = \underbrace{\log p(y | \theta, \mathcal{M})}_{\text{in-sample log-predictive density}} - \underbrace{d(\mathcal{M})}_{\text{penalty for degrees of freedom}}\]
Akaike Information Criterion (AIC)
The “first” information criterion that most people see.
Uses a point estimate (the maximum-likelihood estimate \(\hat{\theta}_\text{MLE}\)) to compute the log-predictive density for the data, corrected by the number of parameters \(k\):
The AIC is defined as \(-2\widehat{\text{elpd}}_\text{AIC}\).
Due to this convention, lower AICs are better (they correspond to a higher predictive skill).
AIC Correction Term
In the case of a model with normal sampling distributions, uniform priors, and sample size \(N >> k\), \(k\) is the asymptotically “correct” bias term (there are modified corrections for small sample sizes).
However, with more informative priors and/or hierarchical models, the bias correction \(k\) is no longer quite right, as there is less “freedom” associated with each parameter.
(a) Simulation of in vs. out of sample AIC calculations with increasing number of simulations
(b)
Figure 2
AIC: Storm Surge Example
Question: Do climate oscillations (e.g. the Pacific Decadal Oscillation) result in variations in tidal extremes at the San Francisco tide gauge station?
Code
# load SF tide gauge data# read in data and get annual maximafunctionload_data(fname) date_format =DateFormat("yyyy-mm-dd HH:MM:SS")# This uses the DataFramesMeta.jl package, which makes it easy to string together commands to load and process data df =@chain fname begin CSV.read(DataFrame; header=false)rename("Column1"=>"year", "Column2"=>"month", "Column3"=>"day", "Column4"=>"hour", "Column5"=>"gauge")# need to reformat the decimal date in the data file@transform:datetime =DateTime.(:year, :month, :day, :hour)# replace -99999 with missing@transform:gauge =ifelse.(abs.(:gauge) .>=9999, missing, :gauge)select(:datetime, :gauge)endreturn dfenddat =load_data("data/surge/h551.csv")# detrend the data to remove the effects of sea-level rise and seasonal dynamicsma_length =366ma_offset =Int(floor(ma_length/2))moving_average(series,n) = [mean(@view series[i-n:i+n]) for i in n+1:length(series)-n]dat_ma =DataFrame(datetime=dat.datetime[ma_offset+1:end-ma_offset], residual=dat.gauge[ma_offset+1:end-ma_offset] .-moving_average(dat.gauge, ma_offset))# group data by year and compute the annual maximadat_ma =dropmissing(dat_ma) # drop missing datadat_annmax =combine(dat_ma -> dat_ma[argmax(dat_ma.residual), :], groupby(DataFrames.transform(dat_ma, :datetime =>x->year.(x)), :datetime_function))delete!(dat_annmax, nrow(dat_annmax)) # delete 2023; haven't seen much of that year yetrename!(dat_annmax, :datetime_function =>:Year)select!(dat_annmax, [:Year, :residual])dat_annmax.residual = dat_annmax.residual # convert to m# make plotspsurge =plot( dat_annmax.Year, dat_annmax.residual; xlabel="Year", ylabel="Annual Max Tide (mm)", label=false, marker=:circle, markersize=5)plot!(psurge, size=(600, 400))
Absolute AIC values have no meaning, only the differences \(\Delta_i = \text{AIC}_i - \text{AIC}_\text{min}\).
Some basic rules of thumb (from Burnham & Anderson (2004)):
\(\Delta_i < 2\) means the model has “strong” support across \(\mathcal{M}\);
\(4 < \Delta_i < 7\) suggests “less” support;
\(\Delta_i > 10\) suggests “weak” or “no” support.
Model Averaging vs. Selection
Model averaging can sometimes be beneficial vs. model selection.
Model selection can introduce bias from the selection process (this is particularly acute for stepwise selection due to path-dependence).
AIC and Model Evidence
\(\exp(-\Delta_i/2)\) can be thought of as a measure of the likelihood of the model given the data \(y\).
The ratio \[\exp(-\Delta_i/2) / \exp(-\Delta_j/2)\] can approximate the relative evidence for \(M_i\) versus \(M_j\).
AIC and Model Averaging
This gives rise to the idea of Akaike weights: \[w_i = \frac{\exp(-\Delta_i/2)}{\sum_{m=1}^M \exp(-\Delta_m/2)}.\]
Model projections can then be weighted based on \(w_i\), which can be interpreted as the probability that \(M_i\) is the best (in the sense of approximating the “true” predictive distribution) model in \(\mathcal{M}\).
Approximation of log-marginal likelihood\[\log p(\mathcal{M}) = \log \int_\theta p(y | \theta, \mathcal{M}) p(\theta | \mathcal{M}) d\theta\] under a whole host of assumptions related to large-sample approximation (so priors don’t matter).
Note that \(\log p(\mathcal{M})\) is the prior predictive density, which is why BIC penalizes model complexity more than AIC.
BIC For Model Comparison
When comparing two models \(\mathcal{M}_1\), \(\mathcal{M}_2\), get an approximation of the log-Bayes Factor (BF):
This means that \(\Delta \text{BIC}\) gives an approximation of the posterior model probabilities across a model set assuming equal prior probabilities.
BIC vs. AIC
BIC tends to select more parsimonious models due to stronger penalty;
AIC will tend to overfit, BIC to underfit.
BIC is consistent but not efficient.
BIC vs. AIC is analogous to the tradeoff between causal vs. predictive analyses. Generally not coherent to use both for the same problem.
Other Information Criteria
Follow the same pattern: compute \(\text{elpd}\) based on some estimate and penalize for model degrees of freedom.
Deviance Information Criteria (DIC)
The Deviance Information Criterion (DIC) is a more Bayesian generalization of AIC which uses the posterior mean \[\hat{\theta}_\text{Bayes} = \mathbb{E}\left[\theta | y\right]\] and a bias correction derived from the data.
where \[d_\text{WAIC} = \sum_{i=1}^n \text{Var}_\text{post}\left(\log p(y_i | \theta)\right).\]
WAIC Correction Factor
\(p_\text{WAIC}\) is an estimate of the number of “unconstrained” parameters in the model.
A parameter counts as 1 if its estimate is “independent” of the prior;
A parameter counts as 0 if it is fully constrained by the prior.
A parameter gives a partial value if both the data and prior are informative.
WAIC vs. AIC and DIC
WAIC can be viewed as an approximation to leave-one-out CV, and averages over the entire posterior, vs. AIC and DIC which use point estimates.
But it doesn’t work well with highly structured data; no real alternative to more clever uses of Bayesian cross-validation.
Key Takeaways and Upcoming Schedule
Key Takeaways
LOO-CV is ideal for navigating bias-variance tradeoff but can be computationally prohibitive.
Information Criteria are an approximation to LOO-CV based on “correcting” for model complexity.
Approximation to out of sample predictive error as a penalty for potential to overfit.
Some ICs approximate K-L Divergence/LOO-CV, others approximate marginal likelihood. Different implications for predictive vs. causal modeling.
Next Classes
Wednesday: Modeling Extreme Values
References
References
Burnham, K. P., & Anderson, D. R. (2004). Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociol. Methods Res., 33, 261–304. https://doi.org/10.1177/0049124104268644
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. J. R. Stat. Soc. Series B Stat. Methodol., 39, 44–47. https://doi.org/10.1111/j.2517-6161.1977.tb01603.x