data imputation

The aim of data imputation is to impute missing data in a data set.

Classifications of the imputation methods

  • categorical or not categorical data for the missing attributes or the observed ones
  • functional imputation or not

Algorithms

MICE : Multivariate Imputation by Chained Equations

It is an iterative algorithm looping over the variables to impute one variable at the time using a model taking as input the other variables. It loops over each variable having missing values several times (usually 10).

MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR). MICE does not offer a guaranty of convergence.

Expectation-maximization and data augmentation

GAIN

It is based on a neuron network model.

VAE

It is based on auto encoder neuron network model.

PMM

It works from one missed ordinal attribute like KNN but taking the most frequent value in the neighborhood.

Decision tree with distribution as leaves

In scikit-learn, DecisionTreeClassifier the predictproba method can be used for implementing this.

A good properties of decision tree is that they are able to avoid some attributes based on some conditions (see Adult dataset, below).

Drawbacks :

  • it does take into account the missingness of variable to obtain the distribution, so MNAR cases can not be tackle efficiently
  • in DBLP:journals/csda/DooveBD14, during the split, variable selection is biased in favour of variables with certain characterics (e.g. variable with many categories), if these variables are no more informative than the competitors.
  • it is univariate
  • Decision trees tend to overfit on data a large number of variables (source)

Implementations

building a BID

Datasets

Adult

IMG_20240313_183028.jpg

https://archive.ics.uci.edu/dataset/2/adult

It is a dataset with mostly categorical attributes, with no duplicate in the tuples.

This post accepts webmentions. Do you have the URL to your post?

Otherwise, send your comment on my service.

Or interact from the fediverse with your username:

fediverse logo Share on the Fediverse