Chapter 11 Handling missing values

11.1 Lecture Missing values

11.1.1 Introduction

The problem of missing values exists since the earliest attempts of exploiting data as a source of knowledge, as it lies intrinsically in the process of obtaining, recording, and preparation of the data itself. Clearly, (citing Gertrude Mary Cox) “The best thing to do with missing values is not to have any”, but in the contemporary world, considering the increasingly growing amount of accessible data and demand in statistical justification this is not always the case, nay never. Main references on missing values include Schafer (1997), Little and Rubin (1987, 2002), van Buuren (2012), Carpenter and Kenward (2013) and (Gelman and Hill, 2007)[chp. 25].

Missing values occur for plenty of reasons: machines that fail, individuals who forget or do not want to answer to some questions of a questionnaire, damaged plants, etc. They are problematic since most statistical methods cannot be applied on a incomplete dataset. In this chapter we review the different types of missing data and statistical methods which allow their incorporation.

11.1.1.1 MCAR, MAR, MNAR

There are several types of missing data, and explaining the reasons why part of the data is missing is crucial to perform inference or any kind of statistical analysiss. Dealing with missing data boils down to considering that the observed data \(X_{\text{OBS}}\) is only a subset of a complete data model \(X = (X_{\text{OBS}},X_{\text{MIS}})\) which is not fully observable (i.e. \(X_{\text{MIS}}\) are the missing data). Assume \(X = (X_1,\ldots,X_n)\); the missing values \(X_{\text{MIS}}\) are characterized by a set of indices \(I_{\text{MIS}}\subset \{1,\ldots,n\}\) such that \(X_{\text{MIS}} = \{X_i; i\in I_{\text{MIS}}\}\). We define the indicator of missingness \(M \in \{0,1\}^n\) such that \(M_i = 1\) if \(i\in I_{\text{MIS}}\) and \(M_i = 0\) otherwise; \(M\) defines the of missingness. Both \(X\) and \(M\) are modeled as random variables with probability distributions \(\mathbb{P}_X\) and \(\mathbb{P}_M\) respectively. The different types of missing data refer to different dependence relationships between \(X_{\text{OBS}},X_{\text{MIS}}\) and \(M\).

The observations are said to be Missing Completely At Random (MCAR) if the probability that an observation is missing is independent of the variables and observations in the dataset: the probability that an observation is missing does not depend on \((X_{\text{OBS}},X_{\text{MIS}})\). Formally,

\[\mathbb{P}_M(M|X_{\text{OBS}},X_{\text{MIS}}) = \mathbb{P}_M(M).\]

The observations are said to be missing at random (MAR) if the probability that an observation is missing only depends on the observed data \(X_{\text{OBS}}\). Formally,

\[\mathbb{P}_M(M|X_{\text{OBS}},X_{\text{MIS}}) = \mathbb{P}_M(M|X_{\text{OBS}}).\]

The observations are said to be Missing Not At Random (MNAR) in all other cases.

11.1.1.2 Ignorable missing values

Many statistical methods are based on estimating a parameter by maximizing the likelihood of the data. Assume that \(X\) has a density, parametrized by some parameter \(\theta\) that we want to estimate - if \(X\) is Gaussian for instance we simply have \(\theta = (\mu, \Sigma)\). Assume that \(M\) also has a density parametrized by another parameter \(\phi\) - for example the probability \(p\) of a Bernoulli distribution. In some cases, estimating \(\theta\) from an incomplete data can be done in a very simple way by ignoring, or “skipping” the missing data, as detailed below.

We denote by \(f(X,M|\theta, \phi)\) the joint density of the observed and missing entries and of the indicator of missingness conditioned on parameters \(\theta\) and \(\phi\). In the context of maximum likelihood estimation, we maximize with respect to \(\theta\) the marginal density of the observed data \(X_{\text{OBS}}\) \[f(X_{\text{OBS}},M|\theta, \phi) = \int f(X_{\text{OBS}}, X_{\text{MIS}},M|\theta, \phi)dX_{\text{MIS}}.\] If the data are MAR (or MCAR), the following factorization holds \[f(X_{\text{OBS}}, X_{\text{MIS}},M|\theta, \phi) = f(X_{\text{OBS}}, X_{\text{MIS}}|\theta) f(M|X_{\text{OBS}}, \phi).\] Plugging this in the expression of the marginal density we obtain \[f(X_{\text{OBS}},M|\theta, \phi) = \int f(X_{\text{OBS}}, X_{\text{MIS}}|\theta) f(M|X_{\text{OBS}}, \phi)dX_{\text{MIS}},\]

\[f(X_{\text{OBS}},M|\theta, \phi) = f(M|X_{\text{OBS}}, \phi)\int f(X_{\text{OBS}}, X_{\text{MIS}}|\theta) dX_{\text{MIS}},\] \[f(X_{\text{OBS}},M|\theta, \phi) = f(M|X_{\text{OBS}}, \phi) f(X_{\text{OBS}}|\theta) .\] If \(\phi\) and \(\theta\) are distinct (the joint parameter space of (\(\theta\), \(\phi\)) is the product of the parameter space of \(\theta\) and the parameter space of \(\phi\)), as the term \(f(M|X_{\text{OBS}}, \phi)\) is respect to \(\theta\), it is equivalent to maximize the likelihood \(f(X_{\text{OBS}}|\theta)\), i.e. to ignore the missing data. It really means that when doing inference, i.e. to get the ML estimates for parameters from an incomplete set, one can “simply” maximizes the observed likelihood while ignoring the process that have generated missing values. Consequently, most of the methods used in practice relie on the assumption that the data are MAR.

11.1.2 Expectation Maximization algorithm

In the case where we are interested in estimating some unknown parameter \(\theta\in\mathbb{R}^d\) characterizing the model (such as \(\mu\) and \(\Sigma\) in the Gaussian example), the Expectation Maximization (EM) algorithm (Dempster et al. 1977) can be used when the joint distribution of the missing data \(X_{\text{MIS}}\) and the observed data \(X_{\text{OBS}}\) is explicit. For all \(\theta\in\mathbb{R}^d\) let \(f_{\theta}\) be the probability density function of \((X_{\text{OBS}},X_{\text{MIS}})\) with respect to a given reference measure \(\mu\). The EM algorithm aims at iteratively maximizing the likelihood of the observations, i.e. the probability density function of the observations, where \(y\) refers to \(X_{\text{OBS}}\) and \(x\) to \(X_{\text{MIS}}\): \[ L_{\theta}(y) = \int f_{\theta}(x,y)\lambda(\mathrm{d}x)\,. \] As this quantity cannot be computed explicitly in general cases, the EM algorithm relies on the surrogate intermediate quantity: \[ Q(\theta,\theta') =\mathbb{E}_{\theta'}[\log f_{\theta}(X_{\text{OBS}},X_{\text{MIS}})|X_{\text{OBS}}]\,, \] where \(\mathbb{E}_{\theta'}\) is the expectation under the law of the model parameterized by \(\theta'\). The following crucial property motivates the EM algorithm: for all \(\theta,\theta'\), \[ \log L_{\theta}(Y) - \log L_{\theta'}(Y) \ge Q(\theta,\theta')-Q(\theta',\theta')\,. \] Therefore, any value \(\theta\) such that \(Q(\theta,\theta')\) is greater than the reference value \(Q(\theta',\theta')\) increases the loglikelihood of the observations. Based on this inequality, the EM algorithm produces iteratively a sequence of parameter estimates \((\theta_p)_{p\ge 0}\). Each iteration is decomposed into two steps: \[ \mbox{E-step: compute} \quad \theta \mapsto Q(\theta,\theta_p)\,,\\ \mbox{M-step: set} \quad \theta_p\in \mbox{Argmax}_\theta Q(\theta,\theta_p)\,. \] The practical interest of this algorithm can be assessed only in cases where \(Q(\theta,\theta_p)\) can be computed (or estimated) with a reasonable computational cost (see for instance the special case where \(f_{\theta}\) belongs to the exponential family) and when \(\theta \mapsto Q(\theta,\theta_p)\) can be maximized (at least numerically).

11.1.3 Conditional distributions in the Gaussian case

Assume first that the complete data \((X,Y)\) has a multivariate normal distribution \(\mathcal{N}(\mu,\Sigma)\). The parameters \(\mu\) and \(\Sigma\) may be estimated using maximum likelihood based procedures for incomplete data models such as the Expectation Maximization algorithm detailed above. Then, the conditional distribution of the missing data \(X_{\text{MIS}}\) given the observations \(X_{\text{OBS}}\) can be derived using Schur complements. If \(\Sigma_{\text{MIS}}\in\mathbb{R}^{m\times m}\) is the covariance of \(X_{\text{MIS}}\), \(\Sigma_{\text{OBS}}\in\mathbb{R}^{p\times p}\) is the covariance of \(X_{\text{OBS}}\) and \(C_{\text{MIS},\text{OBS}}\in\mathbb{R}^{m\times p}\) is the covariance matrix between \(X_{\text{MIS}}\) and \(X_{\text{OBS}}\) then \(\Sigma\) is given by: \[ \Sigma = \begin{pmatrix} \Sigma_{\text{MIS}}&C_{\text{MIS},\text{OBS}}\\C'_{\text{MIS},\text{OBS}}&\Sigma_{\text{OBS}} \end{pmatrix}\,. \] Conditionally on \(X_{\text{OBS}}\), \(X_{\text{MIS}}\) has a normal distribution with covariance matrix \(\Sigma_{X_{\text{MIS}}|X_{\text{OBS}}}\) given by the Schur complement of \(\Sigma_{\text{OBS}}\) in \(\Sigma\): \[ \Sigma_{\text{MIS}|\text{OBS}} = \Sigma_{\text{MIS}} - C_{\text{MIS},\text{OBS}}\Sigma^{-1}_{\text{OBS}}C'_{\text{MIS},\text{OBS}}\,. \]

Note also that the mean \(\mu_{\text{MIS}|\text{OBS}}\) of the distribution of \(X_{\text{MIS}}\) given \(X_{\text{OBS}}\) is: \[ \mu_{\text{MIS}|\text{OBS}} = \mathbb{E}[X_{\text{MIS}}] + C_{\text{MIS},\text{OBS}}\Sigma_{\text{OBS}}^{-1}\left(X_{\text{OBS}} - \mathbb{E}[X_{\text{OBS}}]\right)\,. \]

In R, we can estimate the mean and covariance matrix with EM and then impute missing values with the previous formulae with:

library(norm)
pre <- prelim.norm(as.matrix(don))
thetahat <- em.norm(pre)
getparam.norm(pre,thetahat)
imp <- imp.norm(pre,thetahat,don)

11.1.4 Bootstrap

The bootstrap method is another way to estimate unknown parameters characterizing the statistical model: confidence intervals, estimation of the standard error etc. It is used when the unknown quantity to be estimated can be written as a functional of the unknwon distribution function \(f\) of interest. For instance, in the case of incomplete data models, the bootstrap method is a solution to estimate any quantity which can be expressed as a functional of the unknown conditional distribution \(\pi\) of the latent data \(X_{\text{MIS}}\) given the observations.

Assume that \((X_i)_{1\le i \le n}\) are i.i.d. with common unknown distribution function \(f\) and let \(\theta\in\mathbb{R}^d\) be any parameter characterizing \(f\). The parameter \(\theta\) is estimated by \(\hat{\theta}(X_1,\ldots,X_n)\). Then, the variance of the estimator is given by: \[ \mathbb{V}_f[\hat{\theta}(X_1,\ldots,X_n)] = \mathbb{E}_f\left[\left(\hat{\theta}(X_1,\ldots,X_n)-\mathbb{E}_f[\hat{\theta}(X_1,\ldots,X_n)]\right)^2\right]\,, \] where \(\mathbb{E}_f\) is the expectation under the law of \((X_1,\ldots,X_n)\). The bootstrap estimator of \(\mathbb{V}_f[\hat{\theta}(X_1,\ldots,X_n)]\) is obtained then by replacing the unknwon distribution function \(f\) in this expression by its empirical estimate given, for any \(x\), by: \[ f_n(x) = \frac{1}{n}\sum_{i=1}^n 1_{(-\infty,x]}(X_i)\,. \] For any integrable function \(h\), \[ \mathbb{E}_{f_n}[h(Z)] = \frac{1}{n}\sum_{i=1}^nh(X_i)\,. \] Replacing \(f\) by \(f_n\) can lead to highly involved estimates but in some common situations the bootstrap estimate of \(\mathbb{V}_f[\hat{\theta}(X_1,\ldots,X_n)]\) can be derived easily.

For instance, assume that \(\hat{\theta}(X_1,\ldots,X_n) = \bar{X}_n\) is the empirical estimate of the mean of \(f\). Then, \[ \mathbb{V}_f[\hat{\theta}(X_1,\ldots,X_n)] = \mathbb{V}_f[\bar{X}_n] = \frac{1}{n}\left(\mathbb{E}_f[X_1^2] - \mathbb{E}_f[X_1]^2\right)\,. \] Therefore, the bootstrap estimator of \(\mathbb{V}_f[\hat{\theta}(X_1,\ldots,X_n)]\) is given by \[ \mathbb{V}_{f_n}[\hat{\theta}(X_1,\ldots,X_n)]=\frac{1}{n}\left(\mathbb{E}_{f_n}[X_1^2] - \mathbb{E}_{f_n}[X_1]^2\right) = \frac{1}{n}\left(\frac{1}{n}\sum_{i=1}^nX_i^2 - \bar X_n^2\right)\,. \]

11.1.5 Gibbs sampling

In the case where the complete data \((X_{\text{OBS}},X_{\text{MIS}})\) is not assumed to be a Gaussian vector, we may be interested in estimating or sampling from the (usually unknwon) conditional distribution \(\pi\) of the missing data \(X\) given the observations \(Y\). A widely spread technique to do so is to use Markov Chain Monte Carlo (MCMC) methods which naturally provide simulation based methods which have been successfully applied to many disciplines such as signal processing, biology, target tracking etc…
One of the main objectives of these MCMC algorithms is to produce a Markov chain \((\xi^i)_{i\ge 0}\) targetting the unknown target distribution \(\pi\). Using ergodic theory for Markov chains, it is expected for instance that \(N^{-1}\sum_{i=1}^N f(\xi^i)\) is a good estimate of \[ \int f(x)\pi(\mathrm{d}x) = \mathbb{E}[f(X_{\text{MIS}})|X_{\text{OBS}}] \] for a large class of functions \(f\).

Many MCMC algorithms have been developed to sample the chain \((\xi^i)_{i\ge 1}\), this section details the popular Gibbs sampler which may be used when the conditional distribution of each latent variable given all the other variables has a simple form. Assume that the missing data may be decomposed into several components \((X_1,\ldots,X_m)\) and for all \(1\le k \le m\) let \(\pi_{-k}\) be the conditional distribution of \(X_k\) given the observations and the other missing data. Then, starting with any initial state \(\xi^1 = (\xi^1_1,\ldots,\xi^1_m)\), for all \(i\ge 1\), conditionally on \(\xi^i\), the Gibss sampler samples \(\xi^{i+1}\) as follows. For all \(1\le k \le m\), \(\xi^{i+1}_k\) is sampled accodind to \(\pi_{-k}(\cdot |\xi^{i+1}_1,\ldots,\xi^{i+1}_{k-1},\xi^{i}_{k+1},\xi^{i}_{m})\). All components of the new state \(\xi^{i+1}\) are sampled iteratively according to the conditional distribution given the observations and the other components. A nice feature of this conditional sampler is that at each iteration \(i\ge 1\), every component update \(1\le k \le m\) (which produces \(\xi^{i+1}_k\)) is reversible with respect to \(\pi\) which implies that the Markov kernel associated with each component update admits \(\pi\) as a stationary probability distribution. The Gibbs sampler convergence may be established for the complete update of all components at each iteration and several procedures have been proposed to combine the individual moves (not necessarily with a systematic update of each component in a row).

11.2 Lab

11.2.1 Lecture questions

  • When you suggest methods to deal with missing values to users, the recurrent question is “What is the percentage of missing values that I can have in my data set, is 50% too much but 20% OK?” What is your answer to this question?

  • Explain the aims of multiple imputation in comparison to single imputation.

11.2.2 Continuous data with missing values - Regression with missing data via Multiple Imputation

First of all you will need to install the following packages

install.packages("VIM")
install.packages("missMDA")
install.packages("Amelia")

Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.

We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are

  • maxO3 (maximum daily ozone)
  • maxO3v (maximum daily ozone the previous day)
  • T12 (temperature at midday)
  • T9
  • T15 (Temp at 3pm)
  • Vx12 (projection of the wind speed vector on the east-west axis at midday)
  • Vx9 and Vx15 as well as the Nebulosity (cloud) Ne9, Ne12, Ne15

Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete (there are missing values). Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression via multiple imputation.

  • Import the data.
ozo <- read.table("data/ozoneNA.csv",header=TRUE,
sep=",", row.names=1)
WindDirection <- ozo[,12]
don <- ozo[,1:11]   #### keep the continuous variables
summary(don)
##      maxO3              T9             T12             T15       
##  Min.   : 42.00   Min.   :11.30   Min.   :14.30   Min.   :14.90  
##  1st Qu.: 71.00   1st Qu.:16.00   1st Qu.:18.60   1st Qu.:18.90  
##  Median : 81.50   Median :17.70   Median :20.40   Median :21.40  
##  Mean   : 91.24   Mean   :18.22   Mean   :21.46   Mean   :22.41  
##  3rd Qu.:108.25   3rd Qu.:19.90   3rd Qu.:23.60   3rd Qu.:25.65  
##  Max.   :166.00   Max.   :25.30   Max.   :33.50   Max.   :35.50  
##  NA's   :16       NA's   :37      NA's   :33      NA's   :37     
##       Ne9             Ne12            Ne15           Vx9         
##  Min.   :0.000   Min.   :0.000   Min.   :0.00   Min.   :-7.8785  
##  1st Qu.:3.000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:-3.0000  
##  Median :5.000   Median :5.000   Median :5.00   Median :-0.8671  
##  Mean   :4.987   Mean   :4.986   Mean   :4.60   Mean   :-1.0958  
##  3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:6.25   3rd Qu.: 0.6919  
##  Max.   :8.000   Max.   :8.000   Max.   :8.00   Max.   : 5.1962  
##  NA's   :34      NA's   :42      NA's   :32     NA's   :18       
##       Vx12              Vx15            maxO3v      
##  Min.   :-7.8785   Min.   :-9.000   Min.   : 42.00  
##  1st Qu.:-3.6941   1st Qu.:-3.759   1st Qu.: 70.00  
##  Median :-1.9284   Median :-1.710   Median : 82.50  
##  Mean   :-1.6853   Mean   :-1.830   Mean   : 89.39  
##  3rd Qu.:-0.1302   3rd Qu.: 0.000   3rd Qu.:101.00  
##  Max.   : 6.5778   Max.   : 3.830   Max.   :166.00  
##  NA's   :10        NA's   :21       NA's   :12
head(don)
##          maxO3   T9  T12  T15 Ne9 Ne12 Ne15     Vx9    Vx12    Vx15 maxO3v
## 20010601    87 15.6 18.5   NA   4    4    8  0.6946 -1.7101 -0.6946     84
## 20010602    82   NA   NA   NA   5    5    7 -4.3301 -4.0000 -3.0000     87
## 20010603    92 15.3 17.6 19.5   2   NA   NA  2.9544      NA  0.5209     82
## 20010604   114 16.2 19.7   NA   1    1    0      NA  0.3473 -0.1736     92
## 20010605    94   NA 20.5 20.4  NA   NA   NA -0.5000 -2.9544 -4.3301    114
## 20010606    80 17.7 19.8 18.3   6   NA    7 -5.6382 -5.0000 -6.0000     94
dim(don)
## [1] 112  11
  • Load the libraries.
library(VIM)
library(FactoMineR)

First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism that generated the missingness. For this purpose, we use the R package VIM (Visualization and Imputation of Missing Values - Mathias Templ) as well as Multiple Correspondence Analysis (FactoMineR package). The package VIM provides tools for the visualization of missing or imputed values, which can be used for exploring the data and the structure of the missing or imputed values. Depending on this structure, they may help to identify the mechanism generating the missing values or errors, which may have happened in the imputation process. You should install the package VIM, then you can check the documentation by executing

?VIM

The VIM function aggr calculates and plots the amount of missing entries in each variables and in some combinations of variables (that tend to be missing simultaneously).

dim(na.omit(don))
res<-summary(aggr(don, sortVar=TRUE))$combinations
head(res[rev(order(res[,2])),])
aggr(don, sortVar=TRUE)

The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). If you use Rstudio the plot is not interactive (thus the warnings), but if you use R directly, you can click on a column of your choice, this will result in sorting the rows in the decreasing order of the values of this column. This is useful to check if there is an association between the value of a variable and the missingness of another one.

matrixplot(don,sortby=2) # marche pas sur Rstudio

The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).

marginplot(don[,c("T9","maxO3")])
  • Do you observe any associations between the missing entries ? When values are missing on a variable does it correspond to small or large values on another one ? (For this question you need to use the matrixplot function in R)

  • Create a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, you can perform Multiple Correspondence Analysis to visualize the association with the MCA function.

?MCA

Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. Use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.

  • Determine the number of components ncp to keep using the estim_ncpPCA function. Perform PCA with missing values using the imputePCA function and ncp components. Then plot the correlation circle.
?estim_ncpPCA
?imputePCA
  • Could you guess how cross-validation is performed to select the number of components?

Then, to run the regression with missing values, we use Multiple Imputation. We impute the data either assuming 1) a Gaussian distribution (library Amelia) or 2) a PCA based model (library missMDA). Note that there are two ways to impute either using a Joint Modeling (one joint probabilitisc model for the variables all together) or a Condional Modeling (one model per variable) approach. We refer to the references given in the slides for more details. We use the R package Amelia. We generate 100 imputed data sets with the amelia method:

library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.4, built: 2015-12-05)
## ## Copyright (C) 2005-2017 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
?amelia
res.amelia <- amelia(don, m=100)  
#names(res.amelia$imputations) 
res.amelia$imputations$imp1# the first imputed data set
  • Now generate 100 imputed data sets with the MIPCA method and 2 components. Store the result in a variable called res.MIPCA.
?MIPCA
?plot.MIPCA

We will inspect the imputed values created to know if the imputation method should require more investigation or if we can continue and analyze the data. A common practice consists in comparing the distribution of the imputed values and of the observed values. Check the compare.density function and apply it to compare the distributions of the T12 variable.

?compare.density
  • Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?

The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).

?overimpute
  • Comment the quality of the imputation.

We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).

plot(res.MIPCA,choice= "ind.supp")
plot(res.MIPCA,choice= "var")
  • Apply a regression model on each imputed data set of the amelia method. Hint: a regression with several variables can be performed as follows ‘lm(formula=“maxO3 ~ T9+T12”, data =don)’. You can also use the function with.
resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm, 
                  formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")  
# fitamelia <- lapply(resamelia, with, 
#                     lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
  • Now do the same with the imputed datasets of the MIPCA method.

The package mice (Multiple Imputation by Chained Equations) allows to aggregate the results from some simple models.

library(mice)
## Loading required package: lattice
# ?mice
# pool is a function from mice to aggregate the results according to Rubin's rule
# ?pool
  • Aggregate the results of Regression with Multiple Imputation according to Rubin’s rule (slide “Multiple imputation”) for MI with amelia with the pool function from the mice package.
poolamelia<-pool(as.mira(fitamelia)) 
summary(poolamelia)
  • Now do the same with the MIPCA results.

  • Write a function that removes the variables with the largest pvalues step by step (each time a variable is removed the regression model is performed again) until all variables are significant.

11.2.3 EM algorithm for bivariate normal data with missing values

author: “Julie Josse - Nicolas Brosse - Geneviève Robin” Exam 2016

The purpose of this problem is to use the EM algorithm to estimate the mean of a bivariate normal dataset with missing entries in one of the two variables. We first generate synthetic data and then implement the EM algorithm to compute the estimator of the mean.

library(mvtnorm)

We consider a bivariate normal random variable \(y=\begin{pmatrix} y_1\\ y_2 \end{pmatrix}\) and denote the mean vector and covariance matrix of its distribution \(\mu=\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}\) and \(\Sigma=\begin{pmatrix} \sigma_{11} & \sigma_{12}\\ \sigma_{12} & \sigma_{22} \end{pmatrix}\): \(y\sim \mathcal{N}(\mu, \Sigma)\). We observe a sample of size \(n\) that contains some missing values in the variable \(y_2\), such that for some \(r\leq n\), we observe \((y_{i1},y_{i2})\) for \(i=1,..., r\) and \(y_{i1}\) for \(i=r+1,... n\). The goal is to estimate the mean \(\mu\). We will compare two strategies: 1) direct computation of the maximum likelihood estimator and 2) estimation of the mean with the EM algorithm.

11.2.3.1 Data generation

(R1) Generate a bivariate normal sample of size \(100\) of mean \(\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}=\begin{pmatrix} 5\\ -1 \end{pmatrix}\) and covariance matrix \(\begin{pmatrix} \sigma_{11} & \sigma_{12}\\ \sigma_{12} & \sigma_{22} \end{pmatrix}=\begin{pmatrix} 1.3 & 0.4\\ 0.4 & 0.9 \end{pmatrix}\) containing 30% of missing values in the variable \(y_2\).

11.2.3.2 Maximum likelihood estimator

We denote by \(f_{1,2}(y_1,y_2;\mu, \Sigma)\), \(f_1(y_1;\mu_1, \sigma_{11})\) and \(f_{2|1}(y_2|y_1; \mu, \Sigma)\) the probability density functions of the joint \((y_1,y_2)\), \(y_1\) and \(y_2|y_1\) respectively. The likelihood of the observed data can be written as \[f_{1,2}(y_1,y_2;\mu, \Sigma)=\prod_{i=1}^nf_1(y_{i1})\prod_{j=1}^rf_{2|1}(y_{j2}|y_{j1}),\] and the log-ikelihood is written (up to an additional constant that does not appear in the maximization and that we therefore drop)

\[l(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\sigma_{11}^2)-\frac{1}{2}\sum_{i=1}^n\frac{(y_{i1}-\mu_1)^2}{\sigma_{11}^2}-\frac{r}{2}\log((\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2)\] \[-\frac{1}{2}\sum_{i=1}^r\frac{(y_{i2}-\mu_2-\frac{\sigma_{12}}{\sigma_{11}}(y_{i1}-\mu_1))^2}{(\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2}\]

We skip the computations and directly give the expression of the closed form maximum likelihood estimator of the mean: \[ \hat{\mu}_1=n^{-1}\sum_{i=1}^ny_{i1} \] \[ \hat{\mu}_2=\hat{\beta}_{20.1}+\hat{\beta}_{21.1}\hat{\mu}_1,\]

\(\hat{\beta}_{21.1}=s_{12}/s_{11}\), \(\hat{\beta}_{20.1}=\bar{y}_2-\hat{\beta}_{21.1}\bar{y}_1\), and \(\bar{y}_j=r^{-1}\sum_{i=1}^ry_{ij}\), \(j=1,2\) and \(s_{jk}=r^{-1}\sum_{i=1}^r(y_{ij}-\bar{y}_j)(y_{ik}-\bar{y}_k)\), \(j,k=1,2\)

(R2) Compute the maximum likelihood estimates of \(\mu_1\) and \(\mu_2\).

11.2.3.3 EM algorithm

In this simple setting, we have an explicit expression of the maximum likelihood estimator despite missing values. However, this is not always the case but it is possible to use an EM algorithm which allows to get the maximum likelihood estimators in the cases where data are missing.

The EM algorithm consists in maximizing the “observed likelihood” \[l(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\sigma_{11}^2)-\frac{1}{2}\sum_{i=1}^n\frac{(y_{i1}-\mu_1)^2}{\sigma_{11}^2}-\frac{r}{2}\log((\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2)\] \[-\frac{1}{2}\sum_{i=1}^r\frac{(y_{i2}-\mu_2-\frac{\sigma_{12}}{\sigma_{11}}(y_{i1}-\mu_1))^2}{(\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2},\] through successive maximization of the “complete likelihood” (if we had observed all \(n\) realizations of \(y_1\) and \(y_2\)). Maximizing the complete likelihood \[l_c(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\det(\Sigma))-\frac{1}{2}\sum_{i=1}^n(y_{i1}-\mu_1)^T\Sigma^{-1}(y_{i1}-\mu_1)\]

would be straightforward if we had all the observations. However elements of this likelihood are not available. Consequently, we replace them by the conditional expectation given observed data and the parameters of the current iteration. These two steps of computation of the conditional expectation (E-step) and maximization of the completed likelihood (M step) are repeated until convergence.

The update formulas for the E and M steps are the following

E step:

The sufficient statistics of the likelihood are:

\[s_1=\sum_{i=1}^ny_{i1},\quad s_2=\sum_{i=1}^ny_{i2},\quad s_{11}=\sum_{i=1}^ny_{i1}^2,\quad s_{22}=\sum_{i=1}^ny_{i2}^2,\quad s_{12}=\sum_{i=1}^ny_{i1}y_{i2}.\]

Since some values of \(y_2\) are not available, we fill in the sufficient statistics with:

\[E[y_{i2}|y_{i1},\mu,\Sigma]=\beta_{20.1}+\beta_{21.1}y_{i1}\] \[E[y_{i2}^2|y_{i1},\mu,\Sigma]=(\beta_{20.1}+\beta_{21.1}y_{i1})^2+\sigma_{22.1}\] \[E[y_{i2}y_{i2}|y_{i1},\mu,\Sigma]=(\beta_{20.1}+\beta_{21.1}y_{i1})y_{i1}.\]

M step:

The M step consists in computing the maximum likelihood estimates as usual. Given \(s_1, s_2, s_{11}, s_{22}, \text{and } s_{12},\) update \(\hat{\mu}\) and \(\hat{\sigma}\) with \[\hat{\mu}_1=s_1/n\text{, }\hat{\mu}_2=s_2/n,\] \[\hat{\sigma}_1=s_{11}/n-\hat{\mu}_1^2\text{, }\hat{\sigma}_2=s_{22}/n-\hat{\mu}_2^2\text{, }\hat{\sigma}_{12}=s_{12}/n-\hat{\mu}_1\hat{\mu}_2\]

Note that \(s_1\), \(s_{11}\), \(\hat{\mu}_1\) and \(\hat{\sigma}_1\) are constant accross iterations since we do not have missing values on \(y_1\).

(R3) Write two functions called Estep and Mstep that respectively perform the E step and the M step. The Estep function can take as an input \(\mu\) and \(\Sigma\). Then, you can compute \(\beta_{21.1}=\sigma_{12}/\sigma_{11}\), \(\beta_{20.1}=\mu_2-\beta_{21.1}\mu_1\), and \(\sigma_{22.1}=\sigma_{22}-\sigma^2_{12}/\sigma_{11}\) and update the sufficient statistics \(s_{ij}\). The Mstep function consists in updating the update the \(\mu\) and \(\Sigma\) given the \(s_{ij}\).

(Q1) How could we initialize the algorithm ?

(R4) Implement a function called initEM that returns initial values for \(\hat{\mu}\) and \(\hat{\Sigma}\).

(R5) Implement the EM algorithm over 15 iterations and plot the value of \(\left\|\mu-\hat{\mu}\right\|^2\) over iterations. Comment your results briefly.

(R6) Check that the EM estimator \(\mu\) is equal to the maximum likelihood estimator.