How to deal with missing values in data analysis
Missing values in a data set can be occured from several reasons:
- refusal to answer,
- skipped questions,
- procedural mistakes
- or computer malfunctions
One can try to ignore those missing entries however this approach may be misleading especially for small data sets. Sample containing missing values also hold some value and the information can be easily recovered by imputation. The following site contains useful resources, tutorials and links. In this study, I tried to compare different techniques for the imputation process, corresponding matlab files.
Basically there are three methods:
1. Random Fill
In this approach missing fiels are filled with a number generated according to a probability distribition and the parameters of this distribition can be determined with the rest of that column which are available.
2. Regression
One can also think filling missing value problem as a regression problem. In this case, column containing missing values should be considered as a output values and other columns are dependant variables. Missing values can be estimated with this regression model.
3. Multiple Imputation
In this modern approach, not only one filled tableau is generated but several and then we take the average of them. If we knew the missing values, then estimating the model parameters would be
straightforward. Similarly, if we knew parameters of the data model, then it would be
possible to obtain unbiased prediction for the missing values. An iterative method can
be used: first predict the missing values based on assumed values for the parameters,
use these predictions to update the parameter estimates, and repeat. The sequence
of parameters converges to maximum-likelihood estimates.
Comment
You must be logged in to post a comment.
2 Comments
pant3r says:
February 17th, 2010 at 4:24 pmtest
Arman says:
February 18th, 2010 at 11:07 amTEST