Lecture 17
March 24, 2024
Commonly discussed in terms of “bias-variance tradeoff”: more valuable to think of these as contributors to total error.
Potential to overfit vs. underfit isn’t directly related to standard metrics of “model complexity” (number of parameters, etc).
Instead, think of degrees of freedom: how much flexibility is there given the model parameterization to “chase” reduced model error?
Can reduce degrees of freedom with regularization: tighter/more skeptical priors, shrinkage of estimates (e.g. LASSO) vs. “raw” MLE.

Source: Richard McElreath
What do we want to see in a probabilistic projection \(F\)?
Common to use the PIT to make these more concrete: \(Z_F = F(y)\).
The forecast is probabilistically calibrated if \(Z_F \sim Uniform(0, 1)\).
The forecast is properly dispersed if \(\text{Var}(Z_F) = 1/12\).
Sharpness can be measured by the width of a particular prediction interval. A good forecast is a sharp as possible subject to calibration (Gneiting et al., 2007).
# "true" observation distribution is N(2, 0.5)
obs = rand(Normal(2, 0.5), 50)
# forecast according to the "correct" distribution and obtain PIT
pit_corr = cdf(Normal(2, 0.45), obs)
p_corr = histogram(pit_corr, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf1 = plot(xrange, cdf.(Normal(2, 0.4), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf1, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf1)
display(p_corr)# forecast according to an underdispersed distribution and obtain PIT
pit_under = cdf(Normal(2, 0.1), obs)
p_under = histogram(pit_under, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf2 = plot(xrange, cdf.(Normal(2, 0.1), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf2, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf2)
display(p_under)# forecast according to an overdispersed distribution and obtain PIT
pit_over = cdf(Normal(2, 1), obs)
p_over = histogram(pit_over, bins=10, label=false, xlabel=L"$y$", ylabel="Count", size=(500, 500))
xrange = 0:0.01:5
p_cdf3 = plot(xrange, cdf.(Normal(2, 1), xrange), xlabel=L"$y$", ylabel="Cumulative Density", label="Forecast", size=(500, 500))
plot!(p_cdf3, xrange, cdf.(Normal(2, 0.5), xrange), label="Truth")
display(p_cdf3)
display(p_over)Scoring rules compare observations against an entire probabilistic forecast.
A scoring rule \(S(F, y)\) measures the “loss” of a predicted probability distribution \(F\) once an observation \(y\) is obtained.
Typically oriented so smaller = better.
Proper scoring rules are intended to encourage forecasters to provide their full (and honest) forecasts.
Minimized when the forecasted distribution matches the observed distribution:
\[\mathbb{E}_Y(S(G, G)) \leq \mathbb{E}_Y(S(F, G)) \qquad \forall F.\]
It is strictly proper if equality holds only if \(F = G\).
Most classification algorithms produce a probability (e.g. logistic regression) of different outcomes.
A common skill metric for classification models is accuracy (sensitivity/specificity): given these probabilities and some threshold to translate them into categorical prediction.
The problem: This translation is a decision problem, not a statistical problem. A probabilistic scoring rule over the predicted probabilities more accurately reflects the skill of the statistical model.
The logarithmic score \(S(F, y) = -\log F(y)\) is (up to equivalence) the only local strictly proper scoring rule (locality ⇒ score depends only on the observation).
This is the negative log-probability: straightforward to use for the likelihood (frequentist forecasts) or posterior (Bayesian forecasts) and generalizes MSE.
We will focus on the logarithmic score.
A model can predict well without being “correct”!
For example, model selection using predictive criteria does not mean you are selecting the “true” model.
The causes of the data cannot be found in the data alone.
Effectively, no. Why?
The goal is then to minimize the generalized (expected) error:
\[\mathbb{E}\left[L(X, \theta)\right] = \int_X L(x, \theta) \pi(x)dx\]
where \(L(x, \theta)\) is an error function capturing the discrepancy between \(\hat{f}(x, \theta)\) and \(y\).
Since we don’t know the “true” distribution of \(y\), we could try to approximate it using the training data:
\[\hat{L} = \min_{\theta \in \Theta} L(x_n, \theta)\]
But: This is minimizing in-sample error and is likely to result an optimistic score.
Instead, let’s divide our data into a training dataset \(y_k\) and testing dataset \(\tilde{y}_l\).
This results in an unbiased estimate of \(\hat{L}\) but is noisy.
What if we repeated this procedure for multiple held-out sets?
If data are large, this is a good approximation.
The problem with \(k\)-fold CV, when data is scarce, is withholding \(n/k\) points.
LOO-CV: Set \(k=n\)
The trouble: estimates of \(L\) are highly correlated since every two datasets share \(n-2\) points.
The benefit: LOO-CV approximates seeing “the next datum”.
Model: \[D \rightarrow S \ {\color{purple}\leftarrow U}\] \[S = f(D, U)\]
Out of Sample:
\(p(y_i | y_{-i})\) = 5.2
In Sample:
\(p(\hat{y}_{-i} | y_{-i})\) = 5.7
LOO-CV Score: 5.8
This is the average log-likelihood of out-of-sample data.
Bayesian LOO-CV involves using the posterior predictive distribution
\[\begin{align*} \text{lppd}_\text{cv} &= \sum_{i=1}^N \log p_{\text{post}}(y_i | \theta_{-i}) \\ &\approx \sum_{i=1}^N \frac{1}{S} \sum_{s=1}^S log p_{\text{post}}(y_i | \theta_{-i, s}), \end{align*}\]
which requires refitting the model without \(y_i\) for every data point.
Drop \(k\) values, refit model on rest of data, check for predictive skill.
As \(k \to n\), this reduces to the prior predictive distribution \[p(y^{\text{rep}}) = \int_{\theta} p(y^{\text{rep}} | \theta) p(\theta) d\theta.\]
Can use cross-validation to evaluate overfitting instead of using different model structure.
What happens to CV error with tighter priors/regularization penalty?
But remember, prediction is not the same as scientific inference: try to balance both considerations.
Wednesday: Entropy and Information Criteria
HW4: Due on 4/11 at 9pm.