Missing Data Imputation etc: Literature and R packages

Introduction / Motivation:

  • Netflix Challenge –> Imputation becomes main stream “fashion”
  • Imputation in computer science is aka “Matrix completion” or “Data completion



(We)blogs on Missing Data Literature etc:

R packages


impute : Hastie et al: knn.impute()

Based on missing.pdf paper, Hastie et al. (1999).


CRAN task view ‘Multivariate’

has section Missing data (not quite comprehensive, annotated by MM):

  • mitools provides tools for multiple imputation, by Thomas Lumley (R core, also author of survey).

  • mice provides Multivariate Imputation by Chained Equations. By Stef van Buuren, it is also the basis of his book

  • VIM provides methods for the Visualisation as well as Imputation of Missing data.
    • basic **interactive* visualization: matrixplot(), scattmatrixMiss()
    • imputation: e.g. kNN()
    • Reference
## To cite package 'VIM' in publications use:
##   Matthias Templ, Andreas Alfons, Alexander Kowarik and Bernd
##   Prantner (2015). VIM: Visualization and Imputation of Missing
##   Values. R package version 4.4.1.
##   http://CRAN.R-project.org/package=VIM
## A BibTeX entry for LaTeX users is
##   @Manual{,
##     title = {VIM: Visualization and Imputation of Missing Values},
##     author = {Matthias Templ and Andreas Alfons and Alexander Kowarik and Bernd Prantner},
##     year = {2015},
##     note = {R package version 4.4.1},
##     url = {http://CRAN.R-project.org/package=VIM},
##   }
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
  • mvnlme provides ML estimation for multivariate normal data with missing values,
  • mix provides multiple imputation for mixed categorical and continuous data.
  • pan provides multiple imputation for missing panel data.
  • aregImpute() and transcan() from Hmisc provide further imputation methods.
  • monomvn deals with estimation models where the missing data pattern is monotone.

Joe L. Schafer’s (“norm”, “cat”, “mix”, “pan”) – see already above

  • norm: MI of multivariate continuous data under a normal model; ch. 5 of Sch97
  • cat : MI of multivariate categorical data under loglinear models; ch. 7–8 of Sch97
  • mix : MI of mixed continuous and categorical data under the general location model; ch. 9 of Sch97; (see above)
  • pan : MI of panel or clustered data under a multivariate linear mixed-effects model. The reference a tech.report available from the package as
## [1] "/sfs/s/linux/rhel3_amd64/app/R/R_local/library_F22/pan"

All four of these are available on CRAN (no longer showing the full description)

sapply(c("norm", "cat", "mix", "pan"), packageDescription)

CRAN task view ’Official Statistics

is considerably more comprehensive (than the Multivariate one):


A distinction between iterative model-based methods, k-nearest neighbor methods and miscellaneous methods is made. However, often the criteria for using a method depend on the scale of the data, which in official statistics are typically a mixture of continuous, semi-continuous, binary, categorical and count variables. In addition, measurement errors may corrupt non-robust imputation methods. Note that only few imputation methods can deal with mixed types of variables and only few methods account for robustness issues.

  • EM-based Imputation Methods:

    • Package mi provides iterative EM-based multiple Bayesian regression imputation of missing values and model checking of the regression models used. The regression models for each variable can also be user-defined. The data set may consist of continuous, semi-continuous, binary, categorical and/or count variables.
    • Package mice provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary, categorical and/or count variables.
    • Package mitools provides tools to perform analyses and combine results from multiply-imputated datasets.
    • Package Amelia provides multiple imputation where first bootstrap samples with the same dimensions as the original data are drawn, and then used for EM-based imputation. It is also possible to impute longitudial data. The package in addition comes with a graphical user interface.
    • Package VIM provides EM-based multiple imputation (function irmi()) using robust estimations, which allows to adequately deal with data including outliers. It can handle data consisting of continuous, semi-continuous, binary, categorical and/or count variables.
    • Package mix provides iterative EM-based multiple regression imputation. The data set may consist of continuous, binary or categorical variables, but methods for semi-continuous variables are missing.
    • Package pan provides multiple imputation for multivariate panel or clustered data.
    • Package norm provides EM-based multiple imputation for multivariate normal data.
    • Package cat provides EM-based multiple imputation for multivariate categorical data.
    • Package MImix provides tools to combine results for multiply-imputed data using mixture approximations.
    • Package robCompositions provides iterative model-based imputation for compositional data (function impCoda()).
  • Nearest Neighbor Imputation Methods

    • Package VIM provides an implementation of the popular sequential and random (within a domain) hot-deck algorithm. VIM also provides a fast k-nearest neighbor (knn) algorithm which can be used for large data sets. It uses a modification of the Gower Distance for numerical, categorical, ordered, continuous and semi-continous variables.
    • Package yaImpute performs popular nearest neighbor routines for imputation of continuous variables where different metrics and methods can be used for determining the distance between observations.
    • Package robCompositions provides knn imputation for compositional data (function impKNNa()) using the Aitchison distance and adjustment of the nearest neighbor.
    • Package rrcovNA provides an algorithm for (robust) sequential imputation (function impSeq() and impSeqRob() by minimizing the determinant of the covariance of the augmented data matrix. It’s application is limited to continuous scaled data.
    • Package impute on Bioconductor impute provides knn imputation of continuous variables.
  • Copula-based Imputation Methods:

    • The S4 class package CoImp imputes multivariate missing data by using conditional copula functions. The imputation procedure is semiparametric: the margins are non-parametrically estimated through local likelihood of low-degree polynomials while a range of different parametric models for the copula can be selected by the user. The missing values are imputed by drawing observations from the conditional density functions by means of the Hit or Miss Monte Carlo method. It works either for a matrix of continuous scaled variables or a matrix of discrete distributions.
  • Miscellaneous Imputation Methods:

    • Package missMDA allows to impute incomplete continuous variables by principal component analysis (PCA) or categorical variables by multiple correspondence analysis (MCA).
    • Package mice (function mice.impute.pmm()) and Package Hmisc (function aregImpute()) allow predicitve mean matching imputation.
    • Package VIM allows to visualize the structure of missing values using suitable plot methods. It also comes with a graphical user interface.


Title: A General Imputation Framework in R
Description: General imputation framework based on variable selection methods including regularisation methods, tree-based models and dimension reduction methods.
Version:    1.0.0
Published:  2014-05-14
Author:     Lingbing Feng, Gen Nowak, Alan. H. Welsh, Terry. J. O'Neill


Title: Matrix Completion via Iterative Soft-Thresholded SVD
Version: 1.4
Date: 2015-2-13
Author: Trevor Hastie and Rahul Mazumder

imputation : Archived in 2014 (policy violation: running on all cores)

  • by Jeff Wong on Github
  • also mentions the important paper by Cai, Candes, Shen et al (preprint on ArXiv), Singular Value Thresholding Algorithm for Matrix Completion

(We)blogs etc on R packages:

Norm Matloff, UC Davis - AKA “Mad (Data) Scientist”

Blog advocating “Available Cases” AC notably because MI (trying Amelia only is “slow and not better statistically”

Thomas Leeper’s course Multiple Imputation (simple ex. w/ mice, mi, and Amelia)

Rmd rendered web pages There are three main R packages … multiple imputation techniques.

  • Amelia (by Gary King and collaborators),
  • mi (by Andrew Gelman and collaborators), and
  • mice (by Stef van Buuren and collaborators)