Multiple Imputation and Class Wrap-Up


Lecture 22

April 21, 2025

Review

Complete-Case Analysis

Common approach to missing data:

Complete-case Analysis: Only consider data for which all variables are available.

  • Can result in bias if missing values have a systematic pattern.
  • Could result in discarding a large amount of data.

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code
n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))
y = f.(x)

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)
Figure 1: Illustration of MCAR Data

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)
Figure 2: Illustration of MAR Data

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code
missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)
Figure 3: Illustration of MCAR Data

Implications of Missingness Mechanism

  1. MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
  2. MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
  3. MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.

Methods for Dealing with Missing Data

  1. Imputation: substitute values for missing data before analysis;
  2. Averaging: find expected values over all possible values of the missing variables.

Multiple Imputation Example

Example Quality Data

Code
dat = CSV.read("data/airquality/airquality.csv", DataFrame)
rename!(dat, :"Solar.R" => :Solar)
dat.Miss_Ozone = ismissing.(dat.Ozone)
dat.Miss_Solar = ismissing.(dat.Solar)
dat[2:5, 1:7]
4×7 DataFrame
Row rownames Ozone Solar Wind Temp Month Day
Int64 Int64? Int64? Float64 Int64 Int64 Int64
1 2 36 118 8.0 72 5 2
2 3 12 149 12.6 74 5 3
3 4 18 313 11.5 62 5 4
4 5 missing missing 14.3 56 5 5
Figure 4: Air quality dataset.

Assessing Missingness (Ozone)

Code
dat_ozone_complete = filter(:Ozone => x -> !ismissing(x), dat)
dat_ozone_missing = filter(:Ozone => x -> ismissing(x), dat)

p1 = scatter(dat_ozone_complete.Temp, dat_ozone_complete.Wind, xlabel = "Temperature (°C)", ylabel="Wind (mph)", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_ozone_missing.Temp, dat_ozone_missing.Wind, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

p2 = scatter(dat_ozone_complete.Month, dat_ozone_complete.Day, xlabel = "Month", ylabel="Day", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_ozone_missing.Month, dat_ozone_missing.Day, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

display(p1)
display(p2)
(a) Air quality dataset.
(b)
Figure 5

Assessing Missingness (Solar)

Code
dat_solar_complete = filter(:Solar => x -> !ismissing(x), dat)
dat_solar_missing = filter(:Solar => x -> ismissing(x), dat)

p1 = scatter(dat_solar_complete.Temp, dat_solar_complete.Wind, xlabel = "Temperature (°C)", ylabel="Wind (mph)", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_solar_missing.Temp, dat_solar_missing.Wind, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

p2 = scatter(dat_solar_complete.Month, dat_solar_complete.Day, xlabel = "Month", ylabel="Day", markersize=5, color=:blue, label="Not Missing")
scatter!(dat_solar_missing.Month, dat_solar_missing.Day, markersize=5, color=:orange, label="Missing")
plot!(size=(600, 450))

display(p1)
display(p2)
(a) Air quality dataset.
(b)
Figure 6

Prediction with Noise

  1. Obtain bootstrap replicate of each imputation regression model.
  2. Impute by simulating from predictive distribution (including noise!).
  3. Fit target regression model to imputed dataset.
  4. Repeat for number of imputations.

Airquality Imputation (Prediction)

Impute using available data:

\[ \begin{align*} \text{Ozone} &\sim f(\text{Wind}, \text{Temp}, \text{Month}, \text{Day}) \\ \text{Solar.R} &\sim g(\text{Wind}, \text{Temp}, \text{Month}, \text{Day}) \\ \end{align*} \]

Candidate Model-Based Imputations

Code
# bootstrap linear models and get predictions
nboot = 100
## model for solar radiation
function impute_bootstrap_model(dat_complete, dat_missing, model_formula)
    idx = sample(1:nrow(dat_complete), nrow(dat_complete), replace=true)
    dat_boot = dat_complete[idx, :]
    mod = lm(model_formula, dat_boot)
    return mod
end

function impute_predict_regression(dat_complete, dat_missing, nboot, model_formula)
    impute_out = zeros(nrow(dat_missing), nboot)
    for i = 1:nboot
        mod = impute_bootstrap_model(dat_complete, dat_missing, model_formula)
        impute_out[:, i] = predict(mod, dat_missing) .+ rand(Normal(0, GLM.dispersion(mod.model)), size(impute_out)[1])
    end
    return impute_out
end

impute_solar = impute_predict_regression(dat_solar_complete, dat_solar_missing, nboot, @formula(Solar ~ Wind + Temp + Month + Day))

## model for ozone
impute_ozone = impute_predict_regression(dat_ozone_complete, dat_ozone_missing, nboot, @formula(Ozone ~ Wind + Temp + Month + Day))

# impute values into the complete-case dataset and plot
function impute_variables(dat, impute_ozone, impute_solar)
    impute = deepcopy(dat)
    impute[ismissing.(impute.Ozone), :Ozone] = round.(impute_ozone[:, 1]; digits=0)
    impute[ismissing.(impute.Solar), :Solar] = round.(impute_solar[:, 1]; digits=0)
    return impute
end
impute1 = impute_variables(dat, impute_ozone[:, 1], impute_solar[:, 1])
p1 = scatter(impute1.Solar[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], impute1.Ozone[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute1.Solar[impute1.Miss_Ozone .| impute1.Miss_Solar], impute1.Ozone[impute1.Miss_Ozone .| impute1.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

impute2 = impute_variables(dat, impute_ozone[:, 2], impute_solar[:, 2])
p2 = scatter(impute2.Solar[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], impute2.Ozone[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute2.Solar[impute2.Miss_Ozone .| impute2.Miss_Solar], impute2.Ozone[impute2.Miss_Ozone .| impute2.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

display(p1)
display(p2)
(a) Air quality dataset.
(b)
Figure 7

Model Imputed Time Series (Ozone)

Code
p1 = plot(dat.rownames, dat.Ozone, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel="Ozone (ppb)")
for i = 1:nrow(dat_ozone_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p1, [dat_ozone_missing[i, :rownames]], impute_ozone[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p1)
Figure 8: Air quality dataset.

Model Imputed Time Series (Solar)

Code
p2 = plot(dat.rownames, dat.Solar, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel=L"Solar (W/m$^2$)")
for i = 1:nrow(dat_solar_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p2, [dat_solar_missing[i, :rownames]], impute_solar[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p2)
Figure 9: Air quality dataset.

Model-Imputed Regression Coefficients

Code
β_boot = zeros(nboot)
σ_boot = zeros(nboot)
for i = 1:nboot
    dat_boot = impute_variables(dat, impute_ozone[:, i], impute_solar[:, i])
    model_boot = lm(@formula(Ozone ~ Solar), dat_boot)
    β_boot[i] = coef(model_boot)[2]
    σ_boot[i] = dispersion(model_boot.model)
end
β_est = mean(β_boot)
σ_est = sqrt(mean(σ_boot.^2) + (1 + 1/nboot) * var(β_boot))

# also get complete case estimates for later
cc_model = lm(@formula(Ozone ~ Solar), dropmissing(dat))
β_cc = coef(cc_model)[2]
σ_cc = dispersion(cc_model.model)
histogram(β_boot, xlabel=L"$\beta$", ylabel="Count", label=false)
vline!([β_cc], color=:red, label="Complete-Case")
plot!(size=(500, 450))
Figure 10: Air quality dataset.

Imputed:

  • \(\hat{\beta} = 0\.1\)
  • \(\hat{\sigma} = 31\.0\)

Complete Case:

  • \(\hat{\beta} = 0\.13\)
  • \(\hat{\sigma} = 31\.0\)

Predictive Mean Matching

  1. Obtain bootstrap replicate of each imputation regression model.
  2. Get predicted value of missing value \(\hat{y}_j\).
  3. Generate candidates by finding \(k\) nearest complete cases (minimize \(|y - \hat{y}_j|\)) or use threshold \(\eta\).
  4. Sample from candidates to get imputed value.
  5. Fit target regression model to imputed dataset.
  6. Repeat for number of imputations.

Predictive Mean Matching Imputations

Code
# bootstrap linear models and get predictions
nboot = 100 # number of bootstrap samples for parameter variabiity
k = 5 # number of nearest-neighbors to sample from

## model for solar radiation
function impute_pmm(dat_complete, dat_missing, nboot, nneighbors, model_formula, target_name)
    impute = zeros(nrow(dat_missing), nboot)
    candidates = zeros(nrow(dat_missing), nboot, nneighbors)
    for i = 1:nboot
        mod = impute_bootstrap_model(dat_complete, dat_missing, model_formula)
= predict(mod, dat_missing) # get predicted value for missing data
        y = predict(mod, dat_complete)
        for j = 1:nrow(dat_missing)
            d = abs.(y .- ŷ[j])
            sort_idx = sortperm(d)
            candidates[j, i, 1:nneighbors] = dat_complete[sort_idx[1:nneighbors], target_name]
            impute[j, i] = sample(candidates[j, i, :])
        end
    end
    return impute
end

impute_ozone_pmm = impute_pmm(dat_ozone_complete, dat_ozone_missing, 100, 5, @formula(Ozone ~ Wind + Temp + Month + Day), Symbol("Ozone"))
impute_solar_pmm = impute_pmm(dat_solar_complete, dat_solar_missing, 100, 5, @formula(Solar ~ Wind + Temp + Month + Day), Symbol("Solar"))


function impute_variables(dat, impute_ozone, impute_solar)
    impute = deepcopy(dat)
    impute[ismissing.(impute.Ozone), :Ozone] = round.(impute_ozone[:, 1]; digits=0)
    impute[ismissing.(impute.Solar), :Solar] = round.(impute_solar[:, 1]; digits=0)
    return impute
end
impute1 = impute_variables(dat, impute_ozone_pmm[:, 1], impute_solar_pmm[:, 1])
p1 = scatter(impute1.Solar[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], impute1.Ozone[.!(impute1.Miss_Ozone) .& .!(impute1.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute1.Solar[impute1.Miss_Ozone .| impute1.Miss_Solar], impute1.Ozone[impute1.Miss_Ozone .| impute1.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

impute2 = impute_variables(dat, impute_ozone_pmm[:, 2], impute_solar_pmm[:, 2])
p2 = scatter(impute2.Solar[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], impute2.Ozone[.!(impute2.Miss_Ozone) .& .!(impute2.Miss_Solar)], color=:blue, markersize=5, xlabel=L"Solar Radiation (W/m$^2$)", ylabel="Ozone (ppb)", label="Observed")
scatter!(impute2.Solar[impute2.Miss_Ozone .| impute2.Miss_Solar], impute2.Ozone[impute2.Miss_Ozone .| impute2.Miss_Solar], color=:orange, markersize=5, label="Imputed")
plot!(size=(600, 450))

display(p1)
display(p2)
(a) Air quality dataset.
(b)
Figure 11

PMM Imputed Time Series (Ozone)

Code
p1mm = plot(dat.rownames, dat.Ozone, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel="Ozone (ppb)")
for i = 1:nrow(dat_ozone_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p1mm, [dat_ozone_missing[i, :rownames]], impute_ozone_pmm[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p1mm)
Figure 12: Air quality dataset.

PMM Imputed Time Series (Solar)

Code
p2mm = plot(dat.rownames, dat.Solar, lw=3, color=:blue, label="Observations", xlabel="Day Number", ylabel=L"Solar (W/m$^2$)")
for i = 1:nrow(dat_solar_missing)
    label = i == 1 ? "Imputations" : false
    boxplot!(p2mm, [dat_solar_missing[i, :rownames]], impute_solar_pmm[i, :], color=:orange, label=label)
end
plot!(size=(1200, 500))

display(p2mm)
Figure 13: Air quality dataset.

PMM-Imputed Regression Coefficients

Code
β_boot = zeros(nboot)
σ_boot = zeros(nboot)
for i = 1:nboot
    dat_boot = impute_variables(dat, impute_ozone[:, i], impute_solar[:, i])
    model_boot = lm(@formula(Ozone ~ Solar), dat_boot)
    β_boot[i] = coef(model_boot)[2]
    σ_boot[i] = dispersion(model_boot.model)
end
β_est = mean(β_boot)
σ_est = sqrt(mean(σ_boot.^2) + (1 + 1/nboot) * var(β_boot))

# also get complete case estimates for later
cc_model = lm(@formula(Ozone ~ Solar), dropmissing(dat))
β_cc = coef(cc_model)[2]
σ_cc = dispersion(cc_model.model)
histogram(β_boot, xlabel=L"$\beta$", ylabel="Count", label=false)
vline!([β_cc], color=:red, label="Complete-Case")
plot!(size=(500, 450))
Figure 14: Air quality dataset.

Imputed:

  • \(\hat{\beta} = 0\.1\)
  • \(\hat{\sigma} = 31\.0\)

Complete Case:

  • \(\hat{\beta} = 0\.13\)
  • \(\hat{\sigma} = 31\.0\)

Comparison of Imputations (Solar)

Code
impute_model_df = DataFrame(impute_solar', :auto)
impute_model_df_stk = stack(impute_model_df)
impute_model_df_stk.Method .= "Prediction"

impute_pmm_df = DataFrame(impute_solar_pmm', :auto)
impute_pmm_df_stk = stack(impute_pmm_df)
impute_pmm_df_stk.Method .= "PMM"

impute_all_df = vcat(impute_model_df_stk, impute_pmm_df_stk)

@df impute_all_df groupedboxplot(:variable, :value, group=:Method, xlabel="Imputed Case", ylabel=L"Solar Radiation (W/m$^2$)")
plot!(size=(1300, 500))
Figure 15: Air quality dataset.

Comparison of Imputations (Ozone)

Code
impute_model_df = DataFrame(impute_ozone', :auto)
impute_model_df_stk = stack(impute_model_df)
impute_model_df_stk.Method .= "Prediction"

impute_pmm_df = DataFrame(impute_ozone_pmm', :auto)
impute_pmm_df_stk = stack(impute_pmm_df)
impute_pmm_df_stk.Method .= "PMM"

impute_all_df = vcat(impute_model_df_stk, impute_pmm_df_stk)

@df impute_all_df groupedboxplot(:variable, :value, group=:Method, xlabel="Imputed Case", ylabel="Ozone (ppb)")
plot!(size=(1300, 500))
Figure 16: Air quality dataset.

Key Points

Key Points

  • Use as much information as possible when conducting multiple imputation.
  • Incorporate as much uncertainty as possible to avoid biasing downstream results: we don’t know what the missing data looks like!

Class Review

Inference and Description Are Linked

  • Knowing what is important to describe requires a model of data generation;
  • Doing meaningful inference requires a model of data generation.

Spidermen Meme

Themes of This Class

  • Probability theory helps us deduce logical implications of theories conditional on our assumptions
  • Cannot use an “objective” procedure to avoid subjective responsibility
  • Vaguely motivated procedures give vague or misleading results

Bart Statistics Meme

Data Generation Approximates Reality

Estimand Estimator Cake

Estimand Estimator Cake

Estimate Cake

Source: Richard McElreath

Class Review

timeline
      Introduction: Overview
                  : Hypothesis Testing and Scientific Inference
      Probability Fundamentals: Prob/Stats "Review"
                              : Modeling Data-Generating Processes
                              : Bayesian Statistics
                              : Model-Data Discrepancy
                              : Autocorrelated Residuals
      Simulation Methods: Monte Carlo
                        : Bootstrap
                        : MCMC
      Model Evaluation: Cross-Validation
                      : Model Selection
      Useful Extras: Extreme Values
                    : Missing Data
                    

Workflow Covered In Class

  1. Exploratory Analysis
  2. Develop candidate model(s) and calibrate.
  3. Simulate from models to assess implications.
  4. Compare evidence for models with scoring rules/information criteria.

Rock Paper Scissors meme

What Might Come Next

  • More advanced probability models (e.g. mixtures)
  • Spatial models
  • Experimental design (confounds)
  • (Probabilistic) Machine learning

Upcoming Schedule

Wednesday: No Class

Next Week + 5/5: Project Presentations

Assessments

HW5: due 5/2.

Literature Critique: Due 5/2.

References