Friday, February 19th, 2010
Missing values in a data set can be occured from several reasons:
One can try to ignore those missing entries however this approach may be misleading especially for small data sets. Sample containing missing values also hold some value and the information can be easily recovered by imputation. The following site contains useful resources, tutorials and links. In this study, I tried to compare different techniques for the imputation process, corresponding matlab files.
Basically there are three methods:
In this approach missing fiels are filled with a number generated according to a probability distribition and the parameters of this distribition can be determined with the rest of that column which are available.
One can also think filling missing value problem as a regression problem. In this case, column containing missing values should be considered as a output values and other columns are dependant variables. Missing values can be estimated with this regression model.
In this modern approach, not only one filled tableau is generated but several and then we take the average of them. If we knew the missing values, then estimating the model parameters would be
straightforward. Similarly, if we knew parameters of the data model, then it would be
possible to obtain unbiased prediction for the missing values. An iterative method can
be used: first predict the missing values based on assumed values for the parameters,
use these predictions to update the parameter estimates, and repeat. The sequence
of parameters converges to maximum-likelihood estimates.
Tags: Classification, Matlab, Pattern Recognition
Posted in Optimization | 2 Comments »