JSM 2012 Home

JSM 2012 Online Program

The views expressed here are those of the individual authors and not necessarily those of the JSM sponsors, their officers, or their staff.

Online Program Home
My Program

Abstract Details

Activity Number: 186
Type: Contributed
Date/Time: Monday, July 30, 2012 : 10:30 AM to 12:20 PM
Sponsor: Social Statistics Section
Abstract - #306533
Title: Missing Value Imputation for Predictive Models on Large and Distributed Data Sources
Author(s): Jing Shyr*+ and Sier Han and Jane Chu
Companies: IBM and IBM and IBM
Address: 233 S. Wacker Drive, Chicago, IL, 60606, United States
Keywords: missing value imputation ; basic statistics ; MapReduce
Abstract:

The paper proposes a method to impute missing values of predictors for the subsequent predictive models on large and distributed data sources using a Map-Reduce approach. Firstly, for each predictor that has missing values, imputation models based only on the target variable are built independently on different data sources and on different machines using the Map functions. During the step, validation samples are extracted randomly across all data sources and merged into one global validation sample along with the collection of imputation models using the Reduce function. Then all imputation models are evaluated based on the global validation sample in a distributed manner using another set of Map functions to select the top K models and form an ensemble model. Thirdly, the ensemble model is sent to each data source to impute missing values of predictors. Finally, the complete dataset can be used to build any models for prediction as well as discovery and interpretation of relationships between the target and a set of predictors.

Different types of imputation models are built based on whether the predictor and target are categorical or continuous. Since only the target variable is used, only basic statistics between the predictor and target variables, such as means, variances, covariance, counts, etc. need to be collected using a single data pass which is important for the large and distributed data sources.


The address information is for the authors that have a + after their name.
Authors who are presenting talks have a * after their name.

Back to the full JSM 2012 program




2012 JSM Online Program Home

For information, contact jsm@amstat.org or phone (888) 231-3473.

If you have questions about the Continuing Education program, please contact the Education Department.