Correlated Discrepancies

Lecture 07

February 12, 2025

Review

Planetary Energy Balance

Representation of Planetary Energy Balance

Source: Reprinted from A Climate Modeling Primer, A. Henderson-Sellers and K. McGuffie, Wiley, pg. 58, (1987) via https://www.e-education.psu.edu/meteo469/node/137.

The Energy Balance Model (EBM)

\[\begin{align*} \overbrace{\frac{dH}{dt}}^{\text{change in heat}} &= \overbrace{F}^{\text{RF}} - \overbrace{\lambda T}^{\substack{\text{change in} \\ \text{temperature}}} \\[1em] \Rightarrow C\frac{dT}{dt} &= F - \lambda T - \gamma(T-T_D)\\ C_D\frac{dT_D}{dt} &= \gamma(T-T_D) \end{align*}\]

Source: Palmer et al. (2018)

EBM Discretization

Use Euler discretization:

\[\begin{align*} T(t+1) &= T(t) + \frac{F(t) - \lambda T(t) - \gamma(T(t) - T_D(d))}{C} \Delta t \\[0.5em] T_D(t+1) &= T_D(t) + \frac{\gamma (T(t) - T_D(t))}{C_D} \Delta t \end{align*}\]

Equilibrium Climate Sensitivity (ECS)

Under steady-state conditions (constant \(F\) and \(dT/dt = 0\)), \[T = \frac{F}{\lambda}.\]

When we double atmospheric CO₂, we refer to the equilibrium temperature \(S\) as the equilibrium climate sensitivity:

\[S = \underbrace{F_{2\times \text{CO}_2}}_{\approx 4 \text{W/m}^2}/\lambda\]

Probability Models for Simulation Models

Most common setting (e.g. Brynjarsdóttir & O’Hagan (2014)):

\[\mathbf{y} = \underbrace{\eta(\mathbf{x}; \theta)}_{\text{model}} + \underbrace{\delta(x)}_{\text{discrepancy}} + \underbrace{\varepsilon}_{\text{error}}\]

Model-Data Discrepancy

For example, \(\delta\) might capture:

Bias (e.g.: model consistently over/underpredicts);
Accumulations of error (e.g.: persistent model underestimates);
Partial observations (e.g.: do you count every animal?)

Correlated Discrepancies

MLE for Gaussian Discrepancy

Parameters	MLE
S	3.4
γ	1.5
α	1.0
d	77.8
D	623.7
T₀	-0.1
σ	0.1

Code

n_samples = 10_000
temp_iid = ebm_wrap(θ_iid) # simulate IID best fit
# simulate projections with discrepancy and errors
temp_iid_proj = zeros(n_samples, length(temp_sd))
for i = 1:length(temp_sd)
    temp_iid_err = rand(Normal(0, sqrt.(θ_iid[end]^2 .+ temp_sd[i]^2)), n_samples)
    temp_iid_proj[:, i] = temp_iid[i] .+ temp_iid_err
end
# calculate quantiles
temp_iid_q = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), temp_iid_proj; dims=1)

p = scatter(time_obs, temp_obs, color=:black, label="Observations", ylabel="(°C)", xlabel="Year", title="Temperature Anomaly")
plot!(p, time_obs, temp_iid_q[2, :], ribbon=(temp_iid_q[2, :] - temp_iid_q[1, :], temp_iid_q[3, :] - temp_iid_q[2, :]), color=:red, fillalpha=0.3, label="Gaussian Discrepancy")
plot!(size=(600, 500))

Figure 1: MLE Fit for Gaussian discrepancy

Analyzing Residual Assumptions

Code

resids_homogauss = temp_obs - temp_iid

p1 = qqnorm(resids_homogauss, xlabel="Theoretical Values", ylabel="Empirical Values", title="Normal Q-Q Plot", size=(600, 500))
pacf_homogauss = pacf(resids_homogauss, 1:5)
p2 = plot(1:5, pacf_homogauss, marker=:circle, line=:stem, linewidth=3, markersize=8, tickfontsize=16, guidefontsize=18, legend=false, ylabel="Partial Autocorrelation", xlabel="Time Lag", title="Partial Autocorrelation Plot", size=(600, 500))
hline!(p2, [0], color=:black, linestyle=:dash)

display(p1)
display(p2)

(a) Residual diagnostics

(b)

Figure 2

AR(1) Residuals

Autocorrelated

\[ \begin{align*} y_t &= \text{EBM}(\theta; F_t) + \delta_t + \varepsilon_t \\ \delta_t &= \rho \delta_{t-1} + \omega_t \\ \varepsilon_t &\sim N(0, \sigma^2_{\text{obs}, t}) \\ \omega_t &\sim N(0, \sigma^2) \end{align*} \]

Independent

\[ \begin{align*} y_t &= \text{EBM}(\theta; F_t) + \delta_t + \varepsilon_t \\ \delta_t &\sim N(0, \sigma^2) \\ \varepsilon_t &\sim N(0, \sigma^2_{\text{obs}, t}) \end{align*} \]

Likelihood Function

Without observation errors (\(\varepsilon\)), can whiten residuals:

\[ \begin{align*} y_1 & \sim N\left(0, \frac{\sigma^2}{1 - \rho^2}\right) \\ y_t - \rho y_{t-1} &\sim N(0, \sigma^2) \end{align*} \]

When observation errors are given in the data (as in here), this also works.

Non-Identifiability of Parameters

But when observation errors are also uncertain, run into non-identifiability:

\[N(0, \sigma^2) + N(0, \sigma_\text{obs}^2) = N(0, {\color{red}\sigma^2 + \sigma_\text{obs}^2})\]

Joint Likelihood for AR(1)

Could also use joint likelihood for residuals \(y_t - \text{EBM}(\theta; F_t)\):

\[ \begin{align*} \mathbf{y} &\sim \mathcal{N}(\mathbf{0}, \Sigma) \\ \Sigma &= \frac{\sigma^2}{1 - \rho^2} \begin{pmatrix}1 + \sigma_{\text{obs}, 1}^2 & \rho & \ldots & \rho^{T-1} \\ \rho & 1 + \sigma_{\text{obs}, 2}^2 & \ldots & \rho^{T-2} \\ \vdots & \vdots & \ddots & \vdots \\ \rho^{T-1} & \rho^{T-2} & \ldots & 1+ \sigma_{\text{obs}, T}^2\end{pmatrix} \end{align*} \]

Whitened Likelihood for AR(1) (in code)

# temp_obs: temperature data
# temp_err: standard deviations for temperatures
# m: model function
function ar_loglik(params, temp_obs, temp_err, m)
    S, γ, α, d, D, T₀, ρ, σ = params 
    ebm_sim = m((S, γ, α, d, D, T₀))
    residuals = temp_obs - ebm_sim
    # whiten residuals
    ll = 0
    # notice addition of observation errors
    for t = 1:length(temp_sd)
        if t == 1
            ll += logpdf(Normal(0, sqrt(σ^2 / (1 - ρ^2) + temp_err[1]^2)), residuals[1])
        else
            resid_wn = residuals[t] - ρ * residuals[t-1]
            ll += logpdf(Normal(0, sqrt(σ^2 + temp_err[t]^2)), resid_wn)
        end
    end
    return ll
end

AR(1) MLE (Code)

lower = [1.0, 0.5, 0.0, 50.0, 200.0, temp_lo[1], -1.0, 0.0]
upper = [5.0, 1.5, 2.0, 200.0, 1000.0, temp_hi[1], 1.0, 10.0]
p0 = [3.0, 1.0, 1.0, 100.0, 800.0,temp_obs[1], 0.0, 5.0]

result = Optim.optimize(params -> -ar_loglik(params, temp_obs, temp_sd, ebm_wrap), lower, upper, p0)
θ_ar = result.minimizer

Comparison of MLE

Parameters	IID	AR
S	3.4	3.3
γ	1.5	1.5
α	1.0	0.9
d	77.8	88.6
D	623.7	649.1
T₀	-0.1	-0.1
ρ	-	0.5
σ	0.1	0.1

AR(1) Hindcast

Code

n = 10_000

# get model hindcasts
temp_iid = ebm_wrap(θ_iid[1:end-1])
temp_ar = ebm_wrap(θ_ar[1:end-2])

# get iid and AR residuals from relevant processes
residuals_iid = stack(rand.(Normal.(0, sqrt.(temp_sd.^2 .+ θ_iid[end].^2)), n), dims=1)
residuals_ar = zeros(length(hind_idx), n)
for t = 1:length(temp_sd)
    if t == 1
        residuals_ar[t, :] = rand(Normal(0, sqrt(θ_ar[end]^2 / (1 - θ_ar[end-1]^2) + temp_sd[1]^2)), n)
    else
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, sqrt(θ_ar[end]^2 + temp_sd[t]^2)), n)
    end
end

# add residuals back to model simulations
model_sim_iid = (residuals_iid .+ temp_iid)'
model_sim_ar = (residuals_ar .+ temp_ar)'

# get quantiles
q90_iid = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid; dims=1) # compute 90% prediction interval
q90_ar = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar; dims=1) # compute 90% prediction interval

p = scatter(time_obs, temp_obs, yerr=(temp_obs - temp_lo, temp_hi - temp_obs), color=:black, label="Observations", ylabel="(°C)", xlabel="Year", title="Temperature Anomaly", markersize=5)
plot!(p, hind_years, q90_iid[2, :], ribbon=(q90_iid[2, :] - q90_iid[1, :], q90_iid[3, :] - q90_iid[2, :]), fillalpha=0.2, linewidth=3, label="IID")
plot!(p, hind_years, q90_ar[2, :], ribbon=(q90_ar[2, :] - q90_ar[1, :], q90_ar[3, :] - q90_ar[2, :]), fillalpha=0.2, linewidth=3, label="AR")
plot!(size=(700, 550))

Figure 3: Hindcast of the EBM with IID and AR(1) residuals.

Coverage Rates:

IID: 90.2%
AR(1): 93.1%

Have We Captured Residual Autocorrelation?

Use simulations to look at distributions of residual probability assumptions.

Code

resids_ar_sim = model_sim_ar .- temp_obs'
boxplot(mapslices(col -> pacf(col, 1:5), resids_ar_sim; dims=1)', label=:false, xlabel="Lag", ylabel="Partial Autocorrelation")
plot!(size=(500, 500))

Figure 4: Distribution of residual partial autocorrelations

Hindcasts May Not Differ Much…

Figure 5: Hindcast of the EBM with IID and AR(1) residuals.

…But Projections Can

Code

ebm_sim_85(params) = ebm(forcing_non_aerosol_85[sim_idx], forcing_aerosol_85[sim_idx], p = params)
ebm_sim_26(params) = ebm(forcing_non_aerosol_26[sim_idx], forcing_aerosol_26[sim_idx], p = params)

# iid residuals
y_err = zeros(length(sim_idx))
y_err[1:length(hind_idx)] = temp_sd

residuals_iid = stack(rand.(Normal.(0, sqrt.(y_err.^2 .+ θ_iid[end]^2)), n), dims=1)
model_iid_85 = ebm_sim_85(θ_iid[1:end-1])
model_iid_26 = ebm_sim_26(θ_iid[1:end-1])
model_sim_iid_85 = (residuals_iid .+ model_iid_85)'
model_sim_iid_26 = (residuals_iid .+ model_iid_26)'
q90_iid_85 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid_85; dims=1) # compute 90% prediction interval```
q90_iid_26 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_iid_26; dims=1) # compute 90% prediction interval```

# AR residuals
residuals_ar = zeros(length(sim_idx), n)
for t = 1:length(sim_idx)
    if t == 1
        residuals_ar[t, :] = rand(Normal(0, sqrt(θ_ar[end]^2 / (1 - θ_ar[end-1]^2 + temp_sd[1]^2))), n)
    elseif t <= length(hind_idx)
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, sqrt(θ_ar[end]^2 + temp_sd[t]^2)), n)
    else
        residuals_ar[t, :] = θ_ar[end-1] * residuals_ar[t-1, :] + rand(Normal(0, θ_ar[end]), n)
    end
end
model_ar_85 = ebm_sim_85(θ_ar[1:end-2])
model_ar_26 = ebm_sim_26(θ_ar[1:end-2])
model_sim_ar_26 = (residuals_ar .+ model_ar_26)'
model_sim_ar_85 = (residuals_ar .+ model_ar_85)'
q90_ar_26 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar_26; dims=1) # compute 90% prediction interval
q90_ar_85 = mapslices(col -> quantile(col, [0.05, 0.5, 0.95]), model_sim_ar_85; dims=1) # compute 90% prediction interval

p_sim= scatter(time_obs, temp_obs, color=:black, label="Data", ylabel="Temperature Anomaly (°C)", xlabel="Year", right_margin=5mm)
plot!(p_sim, sim_years, q90_iid_26[2, :], ribbon=(q90_iid_26[2, :] - q90_iid_26[1, :], q90_iid_26[3, :] - q90_iid_26[2, :]), fillalpha=0.2, linewidth=3, color=:royalblue, label="IID/SSP1-2.6")
plot!(p_sim, sim_years, q90_ar_26[2, :], ribbon=(q90_ar_26[2, :] - q90_ar_26[1, :], q90_ar_26[3, :] - q90_ar_26[2, :]), fillalpha=0.2, linewidth=3, color=:firebrick1, label="AR/SSP1-2.6")
plot!(p_sim, sim_years, q90_iid_85[2, :], ribbon=(q90_iid_85[2, :] - q90_iid_85[1, :], q90_iid_85[3, :] - q90_iid_85[2, :]), fillalpha=0.2, linewidth=3, color=:blue3, label="IID/SSP5-8.5")
plot!(p_sim, sim_years, q90_ar_85[2, :], ribbon=(q90_ar_85[2, :] - q90_ar_85[1, :], q90_ar_85[3, :] - q90_ar_85[2, :]), fillalpha=0.2, linewidth=3, color=:firebrick, label="AR/SSP5-8.5")
plot!(p_sim, size=(1100, 550))
xlims!((2000, 2100))
ylims!((0.75, 5.5))

Figure 6: Differences in the 90% confidence intervals

These Differences Could Be Decision-Relevant

Code

p1 = histogram(model_sim_iid_26[:, end], color=:blue, xlabel="°C", ylabel="Count", label="IID", size=(600, 450), alpha=0.4, title="SSP1-2.6")
histogram!(p1, model_sim_ar_26[:, end], color=:red, label="AR", alpha=0.4)
p2 = histogram(model_sim_iid_85[:, end], color=:blue, xlabel="°C", ylabel="Count", label="IID", size=(600, 450), alpha=0.4, title="SSP5-8.5")
histogram!(p2, model_sim_ar_85[:, end], color=:red, label="AR", alpha=0.4)

display(p1)
display(p2)

(a) Projections of global mean temperature in 2100

(b)

Figure 7

More General Models

Lynx and Hare Pelts

Code

lh_obs = DataFrame(CSV.File("data/ecology/Lynx_Hare.txt", header=[:Year, :Hare, :Lynx]))[:, 1:3]
plot(lh_obs[!, :Year], lh_obs[!, :Lynx], xlabel="Year", ylabel="Pelts (thousands)", markersize=5, markershape=:circle, markercolor=:red, color=:red, linewidth=3, label="Lynx")
plot!(lh_obs[!, :Year], lh_obs[!, :Hare], markersize=5, markershape=:circle, markercolor=:blue, color=:blue, linewidth=3, label="Hare")
plot!(size=(1100, 500))

Figure 8: Lynx and Hare pelt dataset

Predator-Prey Dynamics

\[ \begin{align*} \frac{dH}{dt} &= H_t \underbrace{b_H}_{\substack{\text{birth} \\ \text{rate}}} - H_t (\underbrace{L_t m_H}_{\substack{\text{impact of} \\ \text{lynxes}}}) \\ \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \end{align*} \]

Can The Model Replicate The Patterns?

Code

# specifiyng the diffeq problem using DifferentialEquations.jl
function lynx_hare!(dP, P, θ, t)
    H, L = P
    bh, mh, bl, ml = θ
    dP[1] = (bh - L * mh) * H
    dP[2] = (bl * H - ml) * L
end

# run a simulation based on the lynx_hare! solution
function lh_sim(params, N)
    H₁, L₁ = params[end-1:end]
    prob = ODEProblem(lynx_hare!, [H₁, L₁], N-1, params[1:end-2])
    sol = solve(prob, saveat=1)
    H = map(first, sol.u[1:N])
    L = map(last, sol.u[1:N])
    return (H, L)
end

params = (0.54, 0.005, 0.005, 0.8, 190, 30)
H, L = lh_sim(params, nrow(lh_obs))
plot(lh_obs[!, :Year], H, label="Hare", linewidth=3, xlabel="Year", ylabel="Population (thousands)", title="Simulated Population Dynamics", color=:blue)
plot!(lh_obs[!, :Year], L, label="Lynx", linewidth=3, color=:red)
plot!(size=(1100, 450))

Figure 9: Synthetic data from predator-prey model

What Is A Generative Process For Pelts?

Initial population changes according to predator-prey model;
Some fraction of population are trapped;
Trap rates differ by species and can vary by year.

Predator-Prey Probability Model

\[\underbrace{h_t}_{\substack{\text{hare} \\ \text{pelts}}} \sim \text{LogNormal}(\log(\underbrace{p_H}_{\substack{\text{trap} \\ \text{rate}}} H_T), \sigma_H)\] \[l_t \sim \text{LogNormal}(\log(p_L L_T), \sigma_L)\]

\[ \begin{align*} \frac{dH}{dt} &= H_t b_H - H_t (L_t m_H) \\ H_T &= H_1 + \int_1^T \frac{dH}{dt}dt \end{align*} \]

\[ \begin{align*} \frac{dL}{dt} &= L_t (H_t b_L) - L_t m_L \\ L_T &= L_1 + \int_1^T \frac{dL}{dt}dt \end{align*} \]

Key Points

Think generatively about probability models for calibrating models:
Discrepancy: corrects for mismatches between model output and “state” of system
Observation errors: Probability distribution for observations given discrepancy adjustment
Choice of probability model (including discrepancy) can impact projections even if hindcast (“validation”) does not appear very different.

Discussion of Shmueli (2010)

Questions To Seed Discussion

What do you think are the differences between predictive and explanatory modeling?
What can go wrong when we conflate the two?
Can you think of approaches or workflows which bridge the two paradigms?

Upcoming Schedule

Next Classes

Monday: Feb Break!

Wednesday: Bayesian Statistics

Assessments

Homework 2 available; due next Friday (2/21).

No quiz or reading this week!

References

References (Scroll for Full List)

Brynjarsdóttir, J., & O’Hagan, A. (2014). Learning about physical parameters: the importance of model discrepancy. Inverse Problems, 30, 114007. https://doi.org/10.1088/0266-5611/30/11/114007

Palmer, M. D., Harris, G. R., & Gregory, J. M. (2018). Extending CMIP5 projections of global mean temperature change and sea level rise due to thermal expansion using a physically-based emulator. Environ. Res. Lett., 13, 084003. https://doi.org/10.1088/1748-9326/aad2e4