What do we want to see in a probabilistic projection \(F\)?
Calibration: Does the predicted CDF \(F(y)\) align with the “true” distribution of observations \(y\)? \[\mathbb{P}(y \leq F^{-1}(\tau)) = \tau \qquad \forall \tau \in [0, 1]\]
Dispersion: Is the concentration (variance) of \(F\) aligned with the concentration of observations?
Sharpness: How concentrated are the forecasts \(F\)?
Probability Integral Transform (PIT)
Common to use the PIT to make these more concrete: \(Z_F = F(y)\).
The forecast is probabilistically calibrated if \(Z_F \sim Uniform(0, 1)\).
The forecast is properly dispersed if \(\text{Var}(Z_F) = 1/12\).
Sharpness can be measured by the width of a particular prediction interval. A good forecast is a sharp as possible subject to calibration(Gneiting et al., 2007).
Scoring Rules
A scoring rule \(S(F, y)\) measures the “loss” of a predicted probability distribution \(F\) once an observation \(y\) is obtained.
Proper scoring rules are minimized when the forecasted distribution matches the observed distribution:
What if we repeated this procedure for multiple held-out sets?
Randomly split data into \(k = n / m\) equally-sized subsets.
For each \(i = 1, \ldots, k\), fit model to \(y_{-i}\) and test on \(y_i\).
If data are large, this is a good approximation.
LOO-CV Algorithm
Drop one value \(y_i\).
Refit model on rest of data \(y_{-i}\).
Predict dropped point \(p(\hat{y}_i | y_{-i})\).
Evaluate score on dropped point (\(\log p(y_i | y_{-i})\)).
Repeat on rest of data set.
Information and Uncertainty
Interpreting Scores
When directly comparing models, this can be straightforward: lower (usually) score => better.
But this doesn’t tell us anything about whether a particular score is good or even acceptable: How do we quantify the “distance” from “perfect” prediction?
Uncertainty and Information
More uncertainty ⇒ predictions are more difficult.
One approach: quantify information as the reduction in uncertainty conditional on a prediction or projection.
Example: Perfect prediction ⇒ complete reduction in uncertainty (observation will always match prediction).
Quantifying Uncertainty
What properties should a measure of uncertainty possess?
Should be continuous wrt probabilities;
Should increase with number of possible events;
Should be additive.
Entropy
It turns out there is only one function which satisfies these conditions: Information Entropy(Shannon, 1948)
Thus the “divergence” (intuitively: distance) between two distributions is the average difference in log-probabilities between the target \(p\) and the model \(q\).
K-L Divergence Example
Suppose the “true” probability of rain is \(p(\text{Rain}) = 0.65\) and the “true” probability of \(p(\text{Sunshine})=0.35\).
(a) Simulation of in vs. out of sample deviance calculations with increasing number of simulations
(b)
Figure 4
Key Points and Upcoming Schedule
Key Points
Information entropy as measure of uncertainty.
Kullback-Leiber Divergence: measure of “distance” between two distributions.
Difference between K-L divergences lets us compare predictive skill of two models even without knowing “true” probabilities of events.
Deviance: Multiply log-predictive score by -2.
Next Classes
Next Week: Spring Break
Week After: Information Criteria for Model Comparison
Rest of Semester: Specific topics useful for environmental data analysis (extremes, missing data, etc).
Assessments
HW4: Due on 4/11 at 9pm.
References
References (Scroll for Full List)
Gneiting, T., Fadoua Balabdaoui, & Raftery, A. E. (2007). Probabilistic Forecasts, Calibration and Sharpness. J. R. Stat. Soc. Series B Stat. Methodol., 69, 243–268. Retrieved from http://www.jstor.org/stable/4623266