Let \(M_Y\) be the indicator function for whether \(Y\) is missing and let \(\pi(x) = \mathbb{P}(M_Y = 0 | X = x)\) be the inclusion probability.
Goal: Understand the complete-data distribution \(\mathbb{P}(Y=y | X=x)\).
But we only have the observed distribution \(\mathbb{P}(Y = y | X=x, M_Y = 0) \pi(x)\). We are missing \(\mathbb{P}(Y = y | X=x, M_Y = 1) (1-\pi(x))\).
Categories of Missingness
Missingness Complete At Random (MCAR)
MCAR: \(M_Y\) is independent of \(X=x\) and \(Y=y\).
Complete cases are fully representative of the complete data:
MCAR: Strong, but generally implausible. Can only use complete cases as observed data is fully representative.
MAR: More plausible than MCAR, can still justify complete-case analysis as conditional observed distributions are unbiased estimates of conditional complete distributions.
MNAR: Deletion is a bad idea. The observed data does not follow the same conditional distribution. Missingness can be informative: try to model the missingness mechanism.
Checking Assumptions About Missingness
Checking MCAR
In general, we can’t know for sure if missingness \(M_Y\) is informative about \(Y\) (since we can’t see it!).
But we can check if \(M_Y\) is independent of \(X\): if not, reject MCAR.
Can we conclude MCAR if, in our dataset, \(M_Y\) appears independent of \(X\)?
But the data tells us nothing about \(\mathbb{P}(Y=y | X-x, M_Y=1)\). Need to bring to bear understanding of data-collection process.
Instead, try a few different models reflecting different assumptions about missingness: do your conclusions change?
Methods for Dealing with Missing Data
Methods for Dealing with Missing Data
Imputation: substitute values for missing data before analysis;
Averaging: find expected values over all possible values of the missing variables.
Imputation
Imputation does not create “new” information, it reuses existing information to allow the use of standard procedures.
Example: Missing observations in a time series, want to insert values to fit AR(1) model or estimate autocorrelation using “simple” estimators.
As a result, it’s convenient but can create systematic distortions.
Imputation Under MAR
Impute from the marginal distribution (parametrically or non-parametrically), \[p(Y_\text{miss}) = p(Y_\text{obs}).\] This can create distortions if meaningful relationships are neglected.
Impute using a regression model (such as linear imputation). This generalizes relationships but requires missingness being uninformative about \(Y\).
Imputation Under MAR
Impute from the conditional distribution, \[p(Y_\text{miss} | X = x) = p(Y_\text{obs} | X = x).\] Can be done parametrically or non-parametrically.
Impute using matching: find a closest predictor and copy value of \(Y\). Can work okay or be a terrible idea.
Imputation Under MNAR
Need to model missingness mechanism (censoring, etc).
Often need to make assumptions about how the relationship extrapolates.
Model relationship between predictors and missing data;
Add unknown constant to imputed data to reflect biases.
Ultimately, MNAR requires a sensitivity analysis.
Multiple Imputation
Imputing one value for a missing datum cannot be correct in general, because we don’t know what value to impute with certainty (if we did, it wouldn’t be missing).
Generate \(m\) imputations \(\hat{y}_i\) by sampling missing values;
Estimate statistics \(\hat{t}_i\) for each imputation
Pool \(\{\hat{t}_i\}\) and estimate \[\bar{t} = \frac{1}{m} \sum_i \hat{t}_i\]\[\bar{\sigma}^2 = \frac{1}{m}\sum_{i=1}^m \hat{\sigma}_i^2 + (1 + \frac{1}{m}) \text{Var}(\hat{t}_i)\]
Methods for Multiple Imputation
Prediction with noise: Fit a regression model and add noise to expected value. Better to use the bootstrap to also include parameter uncertainty.
Predictive mean matching: Sample missing values from cases with close values of predictors.
In both cases, important to include as much information as possible in the imputation model!
Multiple Imputation Models
No need to be limited to linear regression!
Classification and Regression Trees very common (random forests probably better for additional variation);
Could set up time-specific models.
Bayesian Imputation
Bayesian imputation involves putting a prior over the missing values and treating them as model parameters, resulting in a joint distribution of imputed values and parameters:
Missing data is very common in environmental contexts.
Ability to draw unbiased inferences depends on MCAR, MAR, or MNAR/informativeness of missingness.
Best approach to missing data is to not have any.
Otherwise, try multiple imputation based on understanding/theories of missing mechanisms. Use as much data as possible in these models.
Upcoming Schedule
Monday: Mixture Models and Model-Based Clustering or Gaussian Processes and Emulation.
Next Wednesday (4/23): No Class
Assessments
HW5 released, due 5/2.
Literature Critique: Due 5/2.
Project Presentations: 4/48, 4/30, 5/5.
References
Little, R. J. (2013). In praise of simplicity not mathematistry! Ten simple powerful ideas for the statistical scientist. J. Am. Stat. Assoc., 108, 359–369. https://doi.org/10.1080/01621459.2013.787932
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. (D. B. Rubin, Ed.) (99th ed.). Nashville, TN: John Wiley & Sons. https://doi.org/10.1002/9780470316696