Correlated Discrepancies


Lecture 07

February 12, 2025

Review

Planetary Energy Balance

Representation of Planetary Energy Balance

Source: Reprinted from A Climate Modeling Primer, A. Henderson-Sellers and K. McGuffie, Wiley, pg. 58, (1987) via https://www.e-education.psu.edu/meteo469/node/137.

The Energy Balance Model (EBM)

\[\begin{align*} \overbrace{\frac{dH}{dt}}^{\text{change in heat}} &= \overbrace{F}^{\text{RF}} - \overbrace{\lambda T}^{\substack{\text{change in} \\ \text{temperature}}} \\[1em] \Rightarrow C\frac{dT}{dt} &= F - \lambda T - \gamma(T-T_D)\\ C_D\frac{dT_D}{dt} &= \gamma(T-T_D) \end{align*}\]

Two Layer EBM Schematic

Source: Palmer et al. (2018)

EBM Discretization

Use Euler discretization:

\[\begin{align*} T(t+1) &= T(t) + \frac{F(t) - \lambda T(t) - \gamma(T(t) - T_D(d))}{C} \Delta t \\[0.5em] T_D(t+1) &= T_D(t) + \frac{\gamma (T(t) - T_D(t))}{C_D} \Delta t \end{align*}\]

Equilibrium Climate Sensitivity (ECS)

Under steady-state conditions (constant \(F\) and \(dT/dt = 0\)), \[T = \frac{F}{\lambda}.\]

When we double atmospheric CO2, we refer to the equilibrium temperature \(S\) as the equilibrium climate sensitivity:

\[S = \underbrace{F_{2\times \text{CO}_2}}_{\approx 4 \text{W/m}^2}/\lambda\]

Probability Models for Simulation Models

Most common setting (e.g. Brynjarsdóttir & O’Hagan (2014)):

\[\mathbf{y} = \underbrace{\eta(\mathbf{x}; \theta)}_{\text{model}} + \underbrace{\delta(x)}_{\text{discrepancy}} + \underbrace{\varepsilon}_{\text{error}}\]

Model-Data Discrepancy

For example, \(\delta\) might capture:

  • Bias (e.g.: model consistently over/underpredicts);
  • Accumulations of error (e.g.: persistent model underestimates);
  • Partial observations (e.g.: do you count every animal?)

Correlated Discrepancies

MLE for Gaussian Discrepancy

Parameters MLE
S 3.4
γ 1.5
α 1.0
d 77.8
D 623.7
T₀ -0.1
σ 0.1
Code
n_samples = 10_000
temp_iid = ebm_wrap(θ_iid) # simulate IID best fit
# simulate projections with discrepancy and errors
temp_iid_proj = zeros(n_samples, length(temp_sd))
for i = 1:length(temp_sd)
    temp_iid_err = rand(Normal(0, sqrt.(θ_iid[end]^2 .+ temp_sd[i]^2)), n_samples)
    temp_iid_proj[:, i] = temp_iid[i] .+ temp_iid_err
end
# calculate quantiles
temp_iid_q = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), temp_iid_proj; dims=1)

p = scatter(time_obs, temp_obs, color=:black, label="Observations", ylabel="(°C)", xlabel="Year", title="Temperature Anomaly")
plot!(p, time_obs, temp_iid_q[2, :], ribbon=(temp_iid_q[2, :] - temp_iid_q[1, :], temp_iid_q[3, :] - temp_iid_q[2, :]), color=:red, fillalpha=0.3, label="Gaussian Discrepancy")
plot!(size=(600, 500))
1850 1900 1950 2000 Year −0.5 0.0 0.5 1.0 1.5 (°C) Temperature Anomaly Observations Gaussian Discrepancy
Figure 1: MLE Fit for Gaussian discrepancy

Analyzing Residual Assumptions

Code
resids_homogauss = temp_obs - temp_iid

p1 = qqnorm(resids_homogauss, xlabel="Theoretical Values", ylabel="Empirical Values", title="Normal Q-Q Plot", size=(600, 500))
pacf_homogauss = pacf(resids_homogauss, 1:5)
p2 = plot(1:5, pacf_homogauss, marker=:circle, line=:stem, linewidth=3, markersize=8, tickfontsize=16, guidefontsize=18, legend=false, ylabel="Partial Autocorrelation", xlabel="Time Lag", title="Partial Autocorrelation Plot", size=(600, 500))
hline!(p2, [0], color=:black, linestyle=:dash)

display(p1)
display(p2)
−0.4 −0.2 0.0 0.2 Theoretical Values −0.4 −0.2 0.0 0.2 Empirical Values Normal Q-Q Plot
(a) Residual diagnostics
1 2 3 4 5 Time Lag 0.0 0.1 0.2 0.3 0.4 0.5 Partial Autocorrelation Partial Autocorrelation Plot
(b)
Figure 2

AR(1) Residuals

Autocorrelated

\[ \begin{align*} y_t &= \text{EBM}(\theta; F_t) + \delta_t + \varepsilon_t \\ \delta_t &= \rho \delta_{t-1} + \omega_t \\ \varepsilon_t &\sim N(0, \sigma^2_{\text{obs}, t}) \\ \omega_t &\sim N(0, \sigma^2) \end{align*} \]

Independent

\[ \begin{align*} y_t &= \text{EBM}(\theta; F_t) + \delta_t + \varepsilon_t \\ \delta_t &\sim N(0, \sigma^2) \\ \varepsilon_t &\sim N(0, \sigma^2_{\text{obs}, t}) \end{align*} \]

Likelihood Function

Without observation errors (\(\varepsilon\)), can whiten residuals:

\[ \begin{align*} y_1 & \sim N\left(0, \frac{\sigma^2}{1 - \rho^2}\right) \\ y_t - \rho y_{t-1} &\sim N(0, \sigma^2) \end{align*} \]

When observation errors are given in the data (as in here), this also works.

Non-Identifiability of Parameters

But when observation errors are also uncertain, run into non-identifiability:

\[N(0, \sigma^2) + N(0, \sigma_\text{obs}^2) = N(0, {\color{red}\sigma^2 + \sigma_\text{obs}^2})\]

Joint Likelihood for AR(1)

Could also use joint likelihood for residuals \(y_t - \text{EBM}(\theta; F_t)\):

\[ \begin{align*} \mathbf{y} &\sim \mathcal{N}(\mathbf{0}, \Sigma) \\ \Sigma &= \frac{\sigma^2}{1 - \rho^2} \begin{pmatrix}1 + \sigma_{\text{obs}, 1}^2 & \rho & \ldots & \rho^{T-1} \\ \rho & 1 + \sigma_{\text{obs}, 2}^2 & \ldots & \rho^{T-2} \\ \vdots & \vdots & \ddots & \vdots \\ \rho^{T-1} & \rho^{T-2} & \ldots & 1+ \sigma_{\text{obs}, T}^2\end{pmatrix} \end{align*} \]

Whitened Likelihood for AR(1) (in code)

# temp_obs: temperature data
# temp_err: standard deviations for temperatures
# m: model function
function ar_loglik(params, temp_obs, temp_err, m)
    S, γ, α, d, D, T₀, ρ, σ = params 
    ebm_sim = m((S, γ, α, d, D, T₀))
    residuals = temp_obs - ebm_sim
    # whiten residuals
    ll = 0
    # notice addition of observation errors
    for t = 1:length(temp_sd)
        if t == 1
            ll += logpdf(Normal(0, sqrt^2 / (1 - ρ^2) + temp_err[1]^2)), residuals[1])
        else
            resid_wn = residuals[t] - ρ * residuals[t-1]
            ll += logpdf(Normal(0, sqrt^2 + temp_err[t]^2)), resid_wn)
        end
    end
    return ll
end

AR(1) MLE (Code)

lower = [1.0, 0.5, 0.0, 50.0, 200.0, temp_lo[1], -1.0, 0.0]
upper = [5.0, 1.5, 2.0, 200.0, 1000.0, temp_hi[1], 1.0, 10.0]
p0 = [3.0, 1.0, 1.0, 100.0, 800.0,temp_obs[1], 0.0, 5.0]

result = Optim.optimize(params -> -ar_loglik(params, temp_obs, temp_sd, ebm_wrap), lower, upper, p0)
θ_ar = result.minimizer

Comparison of MLE

Parameters IID AR
S 3.4 3.3
γ 1.5 1.5
α 1.0 0.9
d 77.8 88.6
D 623.7 649.1
T₀ -0.1 -0.1
ρ - 0.5
σ 0.1 0.1

AR(1) Hindcast

Code
n = 10_000

# get model hindcasts
temp_iid = ebm_wrap(θ_iid[1:end-1])
temp_ar = ebm_wrap(θ_ar[1:end-2])

# get iid and AR residuals from relevant processes
residuals_iid = stack(rand.(Normal.(0, sqrt.(temp_sd.^2 .+ θ_iid[end].^2)), n), dims=1)
residuals_ar = zeros(length(hind_idx), n)
for t = 1:length(temp_sd)
    if t == 1
        residuals_ar[t, :] = rand(Normal(0, sqrt(θ_ar[end]^2 / (1 - θ_ar[end-1]^2) + temp_sd[1]^2)), n)
    else
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, sqrt(θ_ar[end]^2 + temp_sd[t]^2)), n)
    end
end

# add residuals back to model simulations
model_sim_iid = (residuals_iid .+ temp_iid)'
model_sim_ar = (residuals_ar .+ temp_ar)'

# get quantiles
q90_iid = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid; dims=1) # compute 90% prediction interval
q90_ar = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar; dims=1) # compute 90% prediction interval

p = scatter(time_obs, temp_obs, yerr=(temp_obs - temp_lo, temp_hi - temp_obs), color=:black, label="Observations", ylabel="(°C)", xlabel="Year", title="Temperature Anomaly", markersize=5)
plot!(p, hind_years, q90_iid[2, :], ribbon=(q90_iid[2, :] - q90_iid[1, :], q90_iid[3, :] - q90_iid[2, :]), fillalpha=0.2, linewidth=3, label="IID")
plot!(p, hind_years, q90_ar[2, :], ribbon=(q90_ar[2, :] - q90_ar[1, :], q90_ar[3, :] - q90_ar[2, :]), fillalpha=0.2, linewidth=3, label="AR")
plot!(size=(700, 550))
1850 1900 1950 2000 Year −0.5 0.0 0.5 1.0 1.5 (°C) Temperature Anomaly Observations IID AR
Figure 3: Hindcast of the EBM with IID and AR(1) residuals.

Coverage Rates:

  • IID: 90.2%
  • AR(1): 93.1%

Have We Captured Residual Autocorrelation?

Use simulations to look at distributions of residual probability assumptions.

Code
resids_ar_sim = model_sim_ar .- temp_obs'
boxplot(mapslices(col -> pacf(col, 1:5), resids_ar_sim; dims=1)', label=:false, xlabel="Lag", ylabel="Partial Autocorrelation")
plot!(size=(500, 500))
1 2 3 4 5 Lag −0.02 0.00 0.02 0.04 Partial Autocorrelation
Figure 4: Distribution of residual partial autocorrelations

Hindcasts May Not Differ Much…

1850 1900 1950 2000 Year −0.5 0.0 0.5 1.0 1.5 (°C) Temperature Anomaly Observations IID AR
Figure 5: Hindcast of the EBM with IID and AR(1) residuals.

…But Projections Can

Code
ebm_sim_85(params) = ebm(forcing_non_aerosol_85[sim_idx], forcing_aerosol_85[sim_idx], p = params)
ebm_sim_26(params) = ebm(forcing_non_aerosol_26[sim_idx], forcing_aerosol_26[sim_idx], p = params)

# iid residuals
y_err = zeros(length(sim_idx))
y_err[1:length(hind_idx)] = temp_sd

residuals_iid = stack(rand.(Normal.(0, sqrt.(y_err.^2 .+ θ_iid[end]^2)), n), dims=1)
model_iid_85 = ebm_sim_85(θ_iid[1:end-1])
model_iid_26 = ebm_sim_26(θ_iid[1:end-1])
model_sim_iid_85 = (residuals_iid .+ model_iid_85)'
model_sim_iid_26 = (residuals_iid .+ model_iid_26)'
q90_iid_85 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid_85; dims=1) # compute 90% prediction interval```
q90_iid_26 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid_26; dims=1) # compute 90% prediction interval```

# AR residuals
residuals_ar = zeros(length(sim_idx), n)
for t = 1:length(sim_idx)
    if t == 1
        residuals_ar[t, :] = rand(Normal(0, sqrt(θ_ar[end]^2 / (1 - θ_ar[end-1]^2 + temp_sd[1]^2))), n)
    elseif t <= length(hind_idx)
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, sqrt(θ_ar[end]^2 + temp_sd[t]^2)), n)
    else
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, θ_ar[end]), n)
    end
end
model_ar_85 = ebm_sim_85(θ_ar[1:end-2])
model_ar_26 = ebm_sim_26(θ_ar[1:end-2])
model_sim_ar_26 = (residuals_ar .+ model_ar_26)'
model_sim_ar_85 = (residuals_ar .+ model_ar_85)'
q90_ar_26 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar_26; dims=1) # compute 90% prediction interval
q90_ar_85 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar_85; dims=1) # compute 90% prediction interval

p_sim= scatter(time_obs, temp_obs, color=:black, label="Data", ylabel="Temperature Anomaly (°C)", xlabel="Year", right_margin=5mm)
plot!(p_sim, sim_years, q90_iid_26[2, :], ribbon=(q90_iid_26[2, :] - q90_iid_26[1, :], q90_iid_26[3, :] - q90_iid_26[2, :]), fillalpha=0.2, linewidth=3, color=:royalblue, label="IID/SSP1-2.6")
plot!(p_sim, sim_years, q90_ar_26[2, :], ribbon=(q90_ar_26[2, :] - q90_ar_26[1, :], q90_ar_26[3, :] - q90_ar_26[2, :]), fillalpha=0.2, linewidth=3, color=:firebrick1, label="AR/SSP1-2.6")
plot!(p_sim, sim_years, q90_iid_85[2, :], ribbon=(q90_iid_85[2, :] - q90_iid_85[1, :], q90_iid_85[3, :] - q90_iid_85[2, :]), fillalpha=0.2, linewidth=3, color=:blue3, label="IID/SSP5-8.5")
plot!(p_sim, sim_years, q90_ar_85[2, :], ribbon=(q90_ar_85[2, :] - q90_ar_85[1, :], q90_ar_85[3, :] - q90_ar_85[2, :]), fillalpha=0.2, linewidth=3, color=:firebrick, label="AR/SSP5-8.5")
plot!(p_sim, size=(1100, 550))
xlims!((2000, 2100))
ylims!((0.75, 5.5))
2000 2025 2050 2075 2100 Year 1 2 3 4 5 Temperature Anomaly (°C) Data IID/SSP1-2.6 AR/SSP1-2.6 IID/SSP5-8.5 AR/SSP5-8.5
Figure 6: Differences in the 90% confidence intervals

These Differences Could Be Decision-Relevant

Code
p1 = histogram(model_sim_iid_26[:, end], color=:blue, xlabel="°C", ylabel="Count", label="IID", size=(600, 450), alpha=0.4, title="SSP1-2.6")
histogram!(p1, model_sim_ar_26[:, end], color=:red, label="AR", alpha=0.4)
p2 = histogram(model_sim_iid_85[:, end], color=:blue, xlabel="°C", ylabel="Count", label="IID", size=(600, 450), alpha=0.4, title="SSP5-8.5")
histogram!(p2, model_sim_ar_85[:, end], color=:red, label="AR", alpha=0.4)

display(p1)
display(p2)
1.6 1.8 2.0 2.2 2.4 °C 0 200 400 600 800 Count SSP1-2.6 IID AR
(a) Projections of global mean temperature in 2100
4.6 4.8 5.0 5.2 5.4 °C 0 300 600 900 Count SSP5-8.5 IID AR
(b)
Figure 7

More General Models

Lynx and Hare Pelts

Code
lh_obs = DataFrame(CSV.File("data/ecology/Lynx_Hare.txt", header=[:Year, :Hare, :Lynx]))[:, 1:3]
plot(lh_obs[!, :Year], lh_obs[!, :Lynx], xlabel="Year", ylabel="Pelts (thousands)", markersize=5, markershape=:circle, markercolor=:red, color=:red, linewidth=3, label="Lynx")
plot!(lh_obs[!, :Year], lh_obs[!, :Hare], markersize=5, markershape=:circle, markercolor=:blue, color=:blue, linewidth=3, label="Hare")
plot!(size=(1100, 500))
1860 1880 1900 1920 Year 0 50 100 150 Pelts (thousands) Lynx Hare
Figure 8: Lynx and Hare pelt dataset

Predator-Prey Dynamics

\[ \begin{align*} \frac{dH}{dt} &= H_t \underbrace{b_H}_{\substack{\text{birth} \\ \text{rate}}} - H_t (\underbrace{L_t m_H}_{\substack{\text{impact of} \\ \text{lynxes}}}) \\ \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \end{align*} \]

Can The Model Replicate The Patterns?

Code
# specifiyng the diffeq problem using DifferentialEquations.jl
function lynx_hare!(dP, P, θ, t)
    H, L = P
    bh, mh, bl, ml = θ
    dP[1] = (bh - L * mh) * H
    dP[2] = (bl * H - ml) * L
end

# run a simulation based on the lynx_hare! solution
function lh_sim(params, N)
    H₁, L₁ = params[end-1:end]
    prob = ODEProblem(lynx_hare!, [H₁, L₁], N-1, params[1:end-2])
    sol = solve(prob, saveat=1)
    H = map(first, sol.u[1:N])
    L = map(last, sol.u[1:N])
    return (H, L)
end

params = (0.54, 0.005, 0.005, 0.8, 190, 30)
H, L = lh_sim(params, nrow(lh_obs))
plot(lh_obs[!, :Year], H, label="Hare", linewidth=3, xlabel="Year", ylabel="Population (thousands)", title="Simulated Population Dynamics", color=:blue)
plot!(lh_obs[!, :Year], L, label="Lynx", linewidth=3, color=:red)
plot!(size=(1100, 450))
1860 1880 1900 1920 Year 50 100 150 200 250 300 350 Population (thousands) Simulated Population Dynamics Hare Lynx
Figure 9: Synthetic data from predator-prey model

What Is A Generative Process For Pelts?

  • Initial population changes according to predator-prey model;
  • Some fraction of population are trapped;
  • Trap rates differ by species and can vary by year.

Predator-Prey Probability Model

\[\underbrace{h_t}_{\substack{\text{hare} \\ \text{pelts}}} \sim \text{LogNormal}(\log(\underbrace{p_H}_{\substack{\text{trap} \\ \text{rate}}} H_T), \sigma_H)\] \[l_t \sim \text{LogNormal}(\log(p_L L_T), \sigma_L)\]

\[ \begin{align*} \frac{dH}{dt} &= H_t b_H - H_t (L_t m_H) \\ H_T &= H_1 + \int_1^T \frac{dH}{dt}dt \end{align*} \]

\[ \begin{align*} \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \\ L_T &= L_1 + \int_1^T \frac{dL}{dt}dt \end{align*} \]

Key Points

Key Points

  • Think generatively about probability models for calibrating models:
  • Discrepancy: corrects for mismatches between model output and “state” of system
  • Observation errors: Probability distribution for observations given discrepancy adjustment
  • Choice of probability model (including discrepancy) can impact projections even if hindcast (“validation”) does not appear very different.

Discussion of Shmueli (2010)

Questions To Seed Discussion

  • What do you think are the differences between predictive and explanatory modeling?
  • What can go wrong when we conflate the two?
  • Can you think of approaches or workflows which bridge the two paradigms?

Upcoming Schedule

Next Classes

Monday: Feb Break!

Wednesday: Bayesian Statistics

Assessments

Homework 2 available; due next Friday (2/21).

No quiz or reading this week!

References

References (Scroll for Full List)

Brynjarsdóttir, J., & O’Hagan, A. (2014). Learning about physical parameters: the importance of model discrepancy. Inverse Problems, 30, 114007. https://doi.org/10.1088/0266-5611/30/11/114007
Palmer, M. D., Harris, G. R., & Gregory, J. M. (2018). Extending CMIP5 projections of global mean temperature change and sea level rise due to thermal expansion using a physically-based emulator. Environ. Res. Lett., 13, 084003. https://doi.org/10.1088/1748-9326/aad2e4