Online Program

A Risk-based Methodology to De-identify Protected Health Information for the Heritage Health Prize
View Presentation View Presentation *Luk Arbuckle, CHEO Research Institute 
Khaled El Emam, CHEO Research Institute 
Ben Eze, Privacy Analytics 
Jonathan Gluck, Heritage Provider Network 
Jeremy Howard, Kaggle 
Gunes Koru, University of Maryland  
Lisa Lisa Gaudette, Privacy Analytics 
Emilio Neri, CHEO Research Institute 
Sean Rose, Privacy Analytics 

Keywords: re-identification, risk assessment, longitudinal, medical data, data disclosure, privacy

According to the US Health Insurance Portability and Accountability Act (HIPAA), the public disclosure of Protected Health Information (PHI) without patient consent is permitted if it is de-identified using accepted statistical methods to manage the risk of individual re-identification. The Heritage Provider Network (HPN), a provider of health care services in California, initiated the Heritage Health Prize (HHP) competition “to develop a predictive algorithm that can identify patients who will be admitted to the hospital within the next year, using historical claims data”. However, the complex longitudinal data from HPN for the HHP competition required the development of new methods to assess and evaluate the risk of re-identification. Five plausible re-identification attacks on this data were identified, and the probability of re-identification was evaluated for each. A de-identification algorithm was applied when the risk of re-identification was found to be above a pre-defined threshold. The final HHP competition dataset had a very small risk of re-identification, and was robust to violations of initial assumptions.