Lecture 16
March 19, 2024

Source: Richard McElreath
function polyfit(d, x, y)
function m(d, θ, x)
mout = zeros(length(x), d + 1)
for j in eachindex(x)
for i = 0:d
mout[j, i + 1] = θ[i + 1] * x[j]^i
end
end
return sum(mout; dims=2)
end
θ₀ = [zeros(d+1); 1.0]
lb = [-10.0 .+ zeros(d+1); 0.01]
ub = [10.0 .+ zeros(d+1); 20.0]
optim_out = optimize(θ -> -sum(logpdf.(Normal.(m(d, θ[1:end-1], x), θ[end]), y)), lb, ub, θ₀)
θmin = optim_out.minimizer
mfit(x) = sum([θmin[i + 1] * x^i for i in 0:d])
return (mfit, θmin[end])
end
function plot_polyfit(d, x, y)
m, σ = polyfit(d, x, y)
p = scatter(x, y, label="Data", markersize=5, ylabel=L"$y$", xlabel=L"$x$", title="Degree $d")
plot!(p, xrange, m.(xrange), ribbon = 1.96 * σ, fillalpha=0.2, lw=3, label="Fit")
ylims!(p, (-30, 15))
plot!(p, size=(600, 450))
return p
end
p1 = plot_polyfit(1, x, y)
p2 = plot_polyfit(2, x, y)
display(p1)
display(p2)We can think of a model as a form of data compression.
Instead of storing coordinates of individual points, project onto parameters of functional form.
The degree to which we can “tune” the model by adjusting parameters are called the model degrees of freedom (DOF), which is one measure of model complexity.
Higher DOF ⇒ more ability to represent complex patterns.
If DOF is too low, the model can’t capture meaningful data-generating signals (underfitting).
Higher DOF ⇒ more ability to represent complex patterns.
But if DOF is too high, the model will “learn” the noise rather than the signal, resulting in poor generalization (overfitting).
ntest = 20
xtest = rand(Uniform(-2, 2), ntest)
ytest = f(xtest) + rand(Normal(0, 2), length(xtest))
in_error = zeros(11)
out_error = zeros(11)
for d = 0:10
m, σ = polyfit(d, x, y)
in_error[d+1] = mean((m.(x) .- y).^2)
out_error[d+1] = mean((m.(xtest) .- ytest).^2)
end
plot(0:10, in_error, markersize=5, color=:blue, lw=3, label="In-Sample Error", xlabel="Polynomial Degree", ylabel="Mean Squared Error", legend=:topleft)
plot!(0:10, out_error, markersize=5, color=:red, lw=3, label="Out-of-Sample Error")
plot!(yaxis=:log)Example from The Signal and the Noise by Nate Silver:
The difference between low and high DOFs can be formalized using bias and variance.
Suppose we have a data-generating model \[y = f(x) + \varepsilon, \varepsilon \sim N(0, \sigma).\] We want to fit a model \(\hat{y} \approx \hat{f}(x)\).
Bias is error from mismatches between the model predictions and the data (\(\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - y\)).
Bias comes from under-fitting meaningful relationships between inputs and outputs:
Variance is error from over-sensitivity to small fluctuations in training inputs \(D\) (\(\text{Variance} = \text{Var}_D(\hat{f}(x; D)\)).
Variance can come from over-fitting noise in the data:
Can decompose MSE into bias and variance terms:
\[ \begin{align*} \text{MSE} &= \mathbb{E}[y - \hat{f}^2] \\ &= \mathbb{E}[y^2 - 2y\hat{f}(x) + \hat{f}^2] \\ &= \mathbb{E}[y^2] - 2\mathbb{E}[y\hat{f}] + E[\hat{f}^2] \\ &= \mathbb{E}[(f + \varepsilon)^2] - \mathbb{E}[(f + \varepsilon)\hat{f}] + E[\hat{f}^2] \\ &= \vdots \\ &= \text{Bias}(\hat{f})^2 + \text{Var}(\hat{f}) + \sigma^2 \end{align*} \]
This means that for a fixed error level, you can reduce bias (increasing model complexity) or decrease variance (simplifying model) but one comes at the cost of the other.
This is the so-called “bias-variance tradeoff.”
This decomposition is for MSE, but the principle holds more generally.

Source: Wikipedia
\[R^2 = (\sigma^2_\text{model}/\sigma^2_\text{data})\]
SMBC: Take It Off
Anscombe’s Quartet (Anscombe, 1973) consists of datasets which have the same summary statistics (including \(R^2\)) but very different graphs.

Source: Wikipedia
In pretty much every case where \(R^2\) might be useful, (root) mean squared error ((R)MSE) is better.
More generally, we want to think about measures which capture the skill of a probabilistic prediction.
These are commonly called scoring rules (Gneiting & Katzfuss, 2014).
Next Week: Model Comparison Methods: Cross-Validation and Information Criteria
Project Proposal: Due Friday.
HW4: Will release this week, due 4/11 (after break).
Literature Critique: Talk to Prof. Srikrishnan if you want help finding a paper.
No quiz this week so you can focus on your project proposal.