Multiple Imputation and Class Wrap-Up

Lecture 22

April 21, 2025

Review

Complete-Case Analysis

Common approach to missing data:

Complete-case Analysis: Only consider data for which all variables are available.

Can result in bias if missing values have a systematic pattern.
Could result in discarding a large amount of data.

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code

n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))
y = f.(x)

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)

Figure 1: Illustration of MCAR Data

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)

Figure 2: Illustration of MAR Data

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)

Figure 3: Illustration of MCAR Data

Implications of Missingness Mechanism

MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.

Methods for Dealing with Missing Data

Imputation: substitute values for missing data before analysis;
Averaging: find expected values over all possible values of the missing variables.

Multiple Imputation Example

Example Quality Data

Code

dat = CSV.read("data/airquality/airquality.csv", DataFrame)
rename!(dat, :"Solar.R" => :Solar)
dat.Miss_Ozone = ismissing.(dat.Ozone)
dat.Miss_Solar = ismissing.(dat.Solar)
dat[2:5, 1:7]

4×7 DataFrame

Row	rownames	Ozone	Solar	Wind	Temp	Month	Day
	Int64	Int64?	Int64?	Float64	Int64	Int64	Int64
1	2	36	118	8.0	72	5	2
2	3	12	149	12.6	74	5	3
3	4	18	313	11.5	62	5	4
4	5	missing	missing	14.3	56	5	5

Figure 4: Air quality dataset.

Assessing Missingness (Ozone)

Code

dat_ozone_complete = filter(:Ozone => x -> !ismissing(x), dat)
dat_ozone_missing = filter(:Ozone => x -> ismissing(x), dat)

p1 = scatter(dat_ozone_complete.Temp, dat_ozone_complete.Wind, xlabel = "Temperature (°C)", ylabel="Wind (mph)", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_ozone_missing.Temp, dat_ozone_missing.Wind, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

p2 = scatter(dat_ozone_complete.Month, dat_ozone_complete.Day, xlabel = "Month", ylabel="Day", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_ozone_missing.Month, dat_ozone_missing.Day, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

display(p1)
display(p2)

(a) Air quality dataset.

(b)

Figure 5

Assessing Missingness (Solar)

Code

dat_solar_complete = filter(:Solar => x -> !ismissing(x), dat)
dat_solar_missing = filter(:Solar => x -> ismissing(x), dat)

p1 = scatter(dat_solar_complete.Temp, dat_solar_complete.Wind, xlabel = "Temperature (°C)", ylabel="Wind (mph)", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_solar_missing.Temp, dat_solar_missing.Wind, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

p2 = scatter(dat_solar_complete.Month, dat_solar_complete.Day, xlabel = "Month", ylabel="Day", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_solar_missing.Month, dat_solar_missing.Day, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

display(p1)
display(p2)

(a) Air quality dataset.

(b)

Figure 6

Prediction with Noise

Obtain bootstrap replicate of each imputation regression model.
Impute by simulating from predictive distribution (including noise!).
Fit target regression model to imputed dataset.
Repeat for number of imputations.

Airquality Imputation (Prediction)

Impute using available data:

\[ \begin{align*} \text{Ozone} &\sim f(\text{Wind}, \text{Temp}, \text{Month}, \text{Day}) \\ \text{Solar.R} &\sim g(\text{Wind}, \text{Temp}, \text{Month}, \text{Day}) \\ \end{align*} \]

Candidate Model-Based Imputations

Code

# bootstrap linear models and get predictions
nboot = 100
## model for solar radiation
function impute_bootstrap_model(dat_complete, dat_missing, model_formula)
    idx = sample(1:nrow(dat_complete), nrow(dat_complete), replace=true)
    dat_boot = dat_complete[idx, :]
    mod = lm(model_formula, dat_boot)
    return mod
end

function impute_predict_regression(dat_complete, dat_missing, nboot, model_formula)
    impute_out = zeros(nrow(dat_missing), nboot)
    for i = 1:nboot
        mod = impute_bootstrap_model(dat_complete, dat_missing, model_formula)
        impute_out[:, i] = predict(mod, dat_missing) .+ rand(Normal(0, GLM.dispersion(mod.model)), size(impute_out)[1])
    end
    return impute_out
end

impute_solar = impute_predict_regression(dat_solar_complete, dat_solar_missing, nboot, @formula(Solar ~ Wind + Temp + Month + Day))

## model for ozone
impute_ozone = impute_predict_regression(dat_ozone_complete, dat_ozone_missing, nboot, @formula(Ozone ~ Wind + Temp + Month + Day))

# impute values into the complete-case dataset and plot
function impute_variables(dat, impute_ozone, impute_solar)
    impute = deepcopy(dat)
    impute[ismissing.(impute.Ozone), :Ozone] = round.(impute_ozone[:, 1]; digits=0)
    impute[ismissing.(impute.Solar), :Solar] = round.(impute_solar[:, 1]; digits=0)
    return impute
end
impute1 = impute_variables(dat, impute_ozone[:, 1], impute_solar[:, 1])
p1 = scatter(impute1.Solar[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], impute1.Ozone[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute1.Solar[impute1.Miss_Ozone .| impute1.Miss_Solar], impute1.Ozone[impute1.Miss_Ozone .| impute1.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

impute2 = impute_variables(dat, impute_ozone[:, 2], impute_solar[:, 2])
p2 = scatter(impute2.Solar[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], impute2.Ozone[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute2.Solar[impute2.Miss_Ozone .| impute2.Miss_Solar], impute2.Ozone[impute2.Miss_Ozone .| impute2.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

display(p1)
display(p2)

(a) Air quality dataset.

(b)

Figure 7

Model Imputed Time Series (Ozone)

Code

p1 = plot(dat.rownames, dat.Ozone, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel="Ozone (ppb)")
for i = 1:nrow(dat_ozone_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p1, [dat_ozone_missing[i, :rownames]], impute_ozone[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p1)

Figure 8: Air quality dataset.

Model Imputed Time Series (Solar)

Code

p2 = plot(dat.rownames, dat.Solar, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel=L"Solar (W/m$^2$)")
for i = 1:nrow(dat_solar_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p2, [dat_solar_missing[i, :rownames]], impute_solar[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p2)

Figure 9: Air quality dataset.

Model-Imputed Regression Coefficients

Code

β_boot = zeros(nboot)
σ_boot = zeros(nboot)
for i = 1:nboot
    dat_boot = impute_variables(dat, impute_ozone[:, i], impute_solar[:, i])
    model_boot = lm(@formula(Ozone ~ Solar), dat_boot)
    β_boot[i] = coef(model_boot)[2]
    σ_boot[i] = dispersion(model_boot.model)
end
β_est = mean(β_boot)
σ_est = sqrt(mean(σ_boot.^2) + (1 + 1/nboot) * var(β_boot))

# also get complete case estimates for later
cc_model = lm(@formula(Ozone ~ Solar), dropmissing(dat))
β_cc = coef(cc_model)[2]
σ_cc = dispersion(cc_model.model)
histogram(β_boot, xlabel=L"$\beta$", ylabel="Count", label=false)
vline!([β_cc], color=:red, label="Complete-Case")
plot!(size=(500, 450))

Figure 10: Air quality dataset.

Imputed:

\(\hat{\beta} = 0\.1\)
\(\hat{\sigma} = 31\.0\)

Complete Case:

\(\hat{\beta} = 0\.13\)
\(\hat{\sigma} = 31\.0\)

Predictive Mean Matching

Obtain bootstrap replicate of each imputation regression model.
Get predicted value of missing value \(\hat{y}_j\).
Generate candidates by finding \(k\) nearest complete cases (minimize \(|y - \hat{y}_j|\)) or use threshold \(\eta\).
Sample from candidates to get imputed value.
Fit target regression model to imputed dataset.
Repeat for number of imputations.

Predictive Mean Matching Imputations

Code

# bootstrap linear models and get predictions
nboot = 100 # number of bootstrap samples for parameter variabiity
k = 5 # number of nearest-neighbors to sample from

## model for solar radiation
function impute_pmm(dat_complete, dat_missing, nboot, nneighbors, model_formula, target_name)
    impute = zeros(nrow(dat_missing), nboot)
    candidates = zeros(nrow(dat_missing), nboot, nneighbors)
    for i = 1:nboot
        mod = impute_bootstrap_model(dat_complete, dat_missing, model_formula)
        ŷ = predict(mod, dat_missing) # get predicted value for missing data
        y = predict(mod, dat_complete)
        for j = 1:nrow(dat_missing)
            d = abs.(y .- ŷ[j])
            sort_idx = sortperm(d)
            candidates[j, i, 1:nneighbors] = dat_complete[sort_idx[1:nneighbors], target_name]
            impute[j, i] = sample(candidates[j, i, :])
        end
    end
    return impute
end

impute_ozone_pmm = impute_pmm(dat_ozone_complete, dat_ozone_missing, 100, 5, @formula(Ozone ~ Wind + Temp + Month + Day), Symbol("Ozone"))
impute_solar_pmm = impute_pmm(dat_solar_complete, dat_solar_missing, 100, 5, @formula(Solar ~ Wind + Temp + Month + Day), Symbol("Solar"))


function impute_variables(dat, impute_ozone, impute_solar)
    impute = deepcopy(dat)
    impute[ismissing.(impute.Ozone), :Ozone] = round.(impute_ozone[:, 1]; digits=0)
    impute[ismissing.(impute.Solar), :Solar] = round.(impute_solar[:, 1]; digits=0)
    return impute
end
impute1 = impute_variables(dat, impute_ozone_pmm[:, 1], impute_solar_pmm[:, 1])
p1 = scatter(impute1.Solar[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], impute1.Ozone[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute1.Solar[impute1.Miss_Ozone .| impute1.Miss_Solar], impute1.Ozone[impute1.Miss_Ozone .| impute1.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

impute2 = impute_variables(dat, impute_ozone_pmm[:, 2], impute_solar_pmm[:, 2])
p2 = scatter(impute2.Solar[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], impute2.Ozone[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute2.Solar[impute2.Miss_Ozone .| impute2.Miss_Solar], impute2.Ozone[impute2.Miss_Ozone .| impute2.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

display(p1)
display(p2)

(a) Air quality dataset.

(b)

Figure 11

PMM Imputed Time Series (Ozone)

Code

p1mm = plot(dat.rownames, dat.Ozone, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel="Ozone (ppb)")
for i = 1:nrow(dat_ozone_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p1mm, [dat_ozone_missing[i, :rownames]], impute_ozone_pmm[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p1mm)

Figure 12: Air quality dataset.

PMM Imputed Time Series (Solar)

Code

p2mm = plot(dat.rownames, dat.Solar, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel=L"Solar (W/m$^2$)")
for i = 1:nrow(dat_solar_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p2mm, [dat_solar_missing[i, :rownames]], impute_solar_pmm[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p2mm)

Figure 13: Air quality dataset.

PMM-Imputed Regression Coefficients

Code

β_boot = zeros(nboot)
σ_boot = zeros(nboot)
for i = 1:nboot
    dat_boot = impute_variables(dat, impute_ozone[:, i], impute_solar[:, i])
    model_boot = lm(@formula(Ozone ~ Solar), dat_boot)
    β_boot[i] = coef(model_boot)[2]
    σ_boot[i] = dispersion(model_boot.model)
end
β_est = mean(β_boot)
σ_est = sqrt(mean(σ_boot.^2) + (1 + 1/nboot) * var(β_boot))

# also get complete case estimates for later
cc_model = lm(@formula(Ozone ~ Solar), dropmissing(dat))
β_cc = coef(cc_model)[2]
σ_cc = dispersion(cc_model.model)
histogram(β_boot, xlabel=L"$\beta$", ylabel="Count", label=false)
vline!([β_cc], color=:red, label="Complete-Case")
plot!(size=(500, 450))

Figure 14: Air quality dataset.

Imputed:

\(\hat{\beta} = 0\.1\)
\(\hat{\sigma} = 31\.0\)

Complete Case:

\(\hat{\beta} = 0\.13\)
\(\hat{\sigma} = 31\.0\)

Comparison of Imputations (Solar)

Code

impute_model_df = DataFrame(impute_solar', :auto)
impute_model_df_stk = stack(impute_model_df)
impute_model_df_stk.Method .= "Prediction"

impute_pmm_df = DataFrame(impute_solar_pmm', :auto)
impute_pmm_df_stk = stack(impute_pmm_df)
impute_pmm_df_stk.Method .= "PMM"

impute_all_df = vcat(impute_model_df_stk, impute_pmm_df_stk)

@df impute_all_df groupedboxplot(:variable, :value, group=:Method, xlabel="Imputed Case", ylabel=L"Solar Radiation (W/m$^2$)")
plot!(size=(1300, 500))

Figure 15: Air quality dataset.

Comparison of Imputations (Ozone)

Code

impute_model_df = DataFrame(impute_ozone', :auto)
impute_model_df_stk = stack(impute_model_df)
impute_model_df_stk.Method .= "Prediction"

impute_pmm_df = DataFrame(impute_ozone_pmm', :auto)
impute_pmm_df_stk = stack(impute_pmm_df)
impute_pmm_df_stk.Method .= "PMM"

impute_all_df = vcat(impute_model_df_stk, impute_pmm_df_stk)

@df impute_all_df groupedboxplot(:variable, :value, group=:Method, xlabel="Imputed Case", ylabel="Ozone (ppb)")
plot!(size=(1300, 500))

Figure 16: Air quality dataset.

Key Points

Use as much information as possible when conducting multiple imputation.
Incorporate as much uncertainty as possible to avoid biasing downstream results: we don’t know what the missing data looks like!

Class Review

Inference and Description Are Linked

Knowing what is important to describe requires a model of data generation;
Doing meaningful inference requires a model of data generation.

Themes of This Class

Probability theory helps us deduce logical implications of theories conditional on our assumptions
Cannot use an “objective” procedure to avoid subjective responsibility
Vaguely motivated procedures give vague or misleading results

Data Generation Approximates Reality

Source: Richard McElreath

Class Review

timeline
      Introduction: Overview
                  : Hypothesis Testing and Scientific Inference
      Probability Fundamentals: Prob/Stats "Review"
                              : Modeling Data-Generating Processes
                              : Bayesian Statistics
                              : Model-Data Discrepancy
                              : Autocorrelated Residuals
      Simulation Methods: Monte Carlo
                        : Bootstrap
                        : MCMC
      Model Evaluation: Cross-Validation
                      : Model Selection
      Useful Extras: Extreme Values
                    : Missing Data

Workflow Covered In Class

Exploratory Analysis
Develop candidate model(s) and calibrate.
Simulate from models to assess implications.
Compare evidence for models with scoring rules/information criteria.

What Might Come Next

More advanced probability models (e.g. mixtures)
Spatial models
Experimental design (confounds)
(Probabilistic) Machine learning

Upcoming Schedule

Wednesday: No Class

Next Week + 5/5: Project Presentations

Assessments

HW5: due 5/2.

Literature Critique: Due 5/2.

Multiple Imputation and Class Wrap-Up

Review

Complete-Case Analysis

Missingness Complete At Random (MCAR)

Missingness At Random (MAR)

Missingness Not-At-Random (MNAR)

Implications of Missingness Mechanism

Methods for Dealing with Missing Data

Multiple Imputation Example

Example Quality Data

Assessing Missingness (Ozone)

Assessing Missingness (Solar)

Prediction with Noise

Airquality Imputation (Prediction)

Candidate Model-Based Imputations

Model Imputed Time Series (Ozone)

Model Imputed Time Series (Solar)

Model-Imputed Regression Coefficients

Predictive Mean Matching

Predictive Mean Matching Imputations

PMM Imputed Time Series (Ozone)

PMM Imputed Time Series (Solar)

PMM-Imputed Regression Coefficients

Comparison of Imputations (Solar)

Comparison of Imputations (Ozone)

Key Points

Key Points

Class Review

Inference and Description Are Linked

Themes of This Class

Data Generation Approximates Reality

Class Review

Workflow Covered In Class

What Might Come Next

Upcoming Schedule

Assessments

References