Missing Data

Lecture 21

April 16, 2025

Review

Extreme Values

Values with a very low probability of occurring, not necessarily high-impact events (which don’t have to be rare!).

“Block” extremes, e.g. annual maxima (block maxima)
Values which exceed a certain threshold (peaks over threshold)

Missing Data

Statistics and Missing Data

…Statistics is a missing data problem.

– Little (2013)

Missing vs. Latent Variables

Missing Data: Variables which are inconsistently observed.

Latent Variables: Unobserved variables which influence data-generating process.

In both cases, we would like to understand the complete-data likelihood (data-generating process including the missing/latent data).

Common (But Flawed) Approach: Complete-Case Analysis

Complete-case Analysis: Only consider data for which all variables are available.

Can result in bias if there is a missing values have a systematic pattern.
Could result in discarding a large amount of data.

Importance of Assumptions

Because we don’t observe the missing data, all approaches require assumptions about how the missing data might have looked.

Examples:

Whether data is missing is entirely random;
Data can be linearly inter-/extrapolated.

Missing Data Example

Code

n = 50
x = rand(Uniform(0, 100), n)
logit(x) = log(x / (1 - x))
invlogit(x) = exp(x) / (1 + exp(x))
f(x) = invlogit(0.05 * (x - 50) + rand(Normal(0, 1)))
y = f.(x)

m(y) = invlogit(0.75 * logit(y))
prob_missing_y = m.(y)
missing_y = Bool.(rand.(Binomial.(1, prob_missing_y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]

p_alldat = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label="Observations", markersize=5, size=(600, 500), ylims=(-0.05, 1))
scatter!(x[missing_y], zeros(sum(missing_y)), markershape=:x, markersize=3, label="Missing Observations")

Figure 1: Missing Data Running Example

Code

linfit = lm([ones(length(xobs)) xobs], yobs)
linpred = predict(linfit, [ones(sum(missing_y)) x[missing_y]])
p1 = deepcopy(p_alldat)
scatter!(p1, x[missing_y], linpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Figure 2: Missing Data Running Example

Missing Data Example

Code

p_alldat

Figure 3: Missing Data Running Example

Code

repeatpred = sample(yobs, sum(missing_y), replace=true)
p2 = deepcopy(p_alldat)
scatter!(p2, x[missing_y], repeatpred, label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Figure 4: Missing Data Running Example

Missing Data Example

Code

p_alldat

Figure 5: Missing Data Running Example

Code

p3 = deepcopy(p_alldat)
scatter!(p3, x[missing_y], y[missing_y], label="Imputed Values", markersize=5, markershape=:diamond, legend=false)

Figure 6: Missing Data Running Example

Comparison of Imputations

Notation

Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.

Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).

But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).

Categories of Missingness

Missingness Complete At Random (MCAR)

MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).

Complete cases are fully representative of the complete data:

\[\mathbb{P}(Y=y) = P(Y=y | M_Y=0)\]

Code

flin(x) = 0.25 * x + 2 + rand(Normal(0, 7))
y = flin.(x)
xpred = collect(0:0.1:100)

lm_all = lm([ones(length(x)) x], y)
y_lm_all = predict(lm_all, [ones(length(xpred)) xpred])

missing_y = Bool.(rand(Binomial.(1, 0.25), length(y)))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mcar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mcar = predict(lm_mcar, [ones(length(xpred)) xpred])


p1 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mcar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mcar), fillalpha=0.2)

Figure 7: Illustration of MCAR Data

Missingness At Random (MAR)

MAR: \(M_Y\) is independent of \(Y=y\) conditional on \(X=x\).

Also called ignorable or uninformative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &= \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.1 * (x .- 75)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mar = predict(lm_mar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mar), fillalpha=0.2)

Figure 8: Illustration of MAR Data

Missingness Not-At-Random (MNAR)

MNAR: \(M_Y\) is dependent on \(Y=y\) (and/or unmodeled variables).

Also called non-ignorable or informative missingness.

\[ \begin{align*} \mathbb{P}&(Y=y | X=x) \\ &\neq \mathbb{P}(Y=y | X=x, M_Y=0) \end{align*} \]

Code

missing_y = Bool.(rand.(Binomial.(1,  invlogit.(0.9 * (y .- 15)))))
xobs = x[.!(missing_y)]
yobs = y[.!(missing_y)]
lm_mnar = lm([ones(n - sum(missing_y)) xobs], yobs)
y_lm_mnar = predict(lm_mnar, [ones(length(xpred)) xpred])

p2 = scatter(xobs, yobs, xlabel=L"$x$", ylabel=L"$y$", label=false, markersize=5, size=(600, 500), color=:blue)
scatter!(x[missing_y], y[missing_y], alpha=0.9, color=:lightgrey, label=false, markersize=5)
plot!(xpred, y_lm_all, color=:red, lw=3, label="Complete-Data Inference", ribbon=GLM.dispersion(lm_all), fillalpha=0.2)
plot!(xpred, y_lm_mnar, color=:blue, lw=3, linestyle=:dot, label="Observed-Data Inference", ribbon=GLM.dispersion(lm_mnar), fillalpha=0.2)

Figure 9: Illustration of MCAR Data

Implications of Missingness Mechanism

MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.

Checking Assumptions About Missingness

Checking MCAR

In general, we can’t know for sure if missingness \(M_Y\) is informative about \(Y\) (since we can’t see it!).

But we can check if \(M_Y\) is independent of \(X\): if not, reject MCAR.

Can we conclude MCAR if, in our dataset, \(M_Y\) appears independent of \(X\)?

Distinguishing MAR from MNAR

Can’t do this statistically!

MAR: \(\mathbb{P}(Y=y | X-x, M_Y=1) = \mathbb{P}(Y=y | X-x, M_Y=0)\)

But the data tells us nothing about \(\mathbb{P}(Y=y | X-x, M_Y=1)\). Need to bring to bear understanding of data-collection process.

Instead, try a few different models reflecting different assumptions about missingness: do your conclusions change?

Methods for Dealing with Missing Data

Imputation: substitute values for missing data before analysis;
Averaging: find expected values over all possible values of the missing variables.

Imputation

Imputation does not create “new” information, it reuses existing information to allow the use of standard procedures.

Example: Missing observations in a time series, want to insert values to fit AR(1) model or estimate autocorrelation using “simple” estimators.

As a result, it’s convenient but can create systematic distortions.

Imputation Under MAR

Impute from the marginal distribution (parametrically or non-parametrically), \[p(Y_\text{miss}) = p(Y_\text{obs}).\] This can create distortions if meaningful relationships are neglected.
Impute using a regression model (such as linear imputation). This generalizes relationships but requires missingness being uninformative about \(Y\).

Imputation Under MAR

Impute from the conditional distribution, \[p(Y_\text{miss} | X = x) = p(Y_\text{obs} | X = x).\] Can be done parametrically or non-parametrically.
Impute using matching: find a closest predictor and copy value of \(Y\). Can work okay or be a terrible idea.

Imputation Under MNAR

Need to model missingness mechanism (censoring, etc).

Often need to make assumptions about how the relationship extrapolates.

Model relationship between predictors and missing data;
Add unknown constant to imputed data to reflect biases.

Ultimately, MNAR requires a sensitivity analysis.

Multiple Imputation

Imputing one value for a missing datum cannot be correct in general, because we don’t know what value to impute with certainty (if we did, it wouldn’t be missing).

—- Rubin (1987)

Multiple Imputation Steps

Generate \(m\) imputations \(\hat{y}_i\) by sampling missing values;
Estimate statistics \(\hat{t}_i\) for each imputation
Pool \(\{\hat{t}_i\}\) and estimate \[\bar{t} = \frac{1}{m} \sum_i \hat{t}_i\] \[\bar{\sigma}^2 = \frac{1}{m}\sum_{i=1}^m \hat{\sigma}_i^2 + (1 + \frac{1}{m}) \text{Var}(\hat{t}_i)\]

Methods for Multiple Imputation

Prediction with noise: Fit a regression model and add noise to expected value. Better to use the bootstrap to also include parameter uncertainty.
Predictive mean matching: Sample missing values from cases with close values of predictors.

In both cases, important to include as much information as possible in the imputation model!

Multiple Imputation Models

No need to be limited to linear regression!

Classification and Regression Trees very common (random forests probably better for additional variation);
Could set up time-specific models.

Bayesian Imputation

Bayesian imputation involves putting a prior over the missing values and treating them as model parameters, resulting in a joint distribution of imputed values and parameters:

\[p(\theta, y_\text{miss} | Y=y_\text{obs}, X=x)\]

Key Points and Upcoming Schedule

Key Points

Missing data is very common in environmental contexts.
Ability to draw unbiased inferences depends on MCAR, MAR, or MNAR/informativeness of missingness.
Best approach to missing data is to not have any.
Otherwise, try multiple imputation based on understanding/theories of missing mechanisms. Use as much data as possible in these models.

Upcoming Schedule

Monday: Mixture Models and Model-Based Clustering or Gaussian Processes and Emulation.

Next Wednesday (4/23): No Class

Assessments

HW5 released, due 5/2.

Literature Critique: Due 5/2.

Project Presentations: 4/48, 4/30, 5/5.

References

Little, R. J. (2013). In praise of simplicity not mathematistry! Ten simple powerful ideas for the statistical scientist. J. Am. Stat. Assoc., 108, 359–369. https://doi.org/10.1080/01621459.2013.787932

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. (D. B. Rubin, Ed.) (99th ed.). Nashville, TN: John Wiley & Sons. https://doi.org/10.1002/9780470316696