Blurring Data for Privacy and Usefulness
Hospitals with research agendas share a common problem: how to use medical records for research while protecting patient privacy. One approach—the data-protection equivalent of blurring the face of an anonymous source on television—has now been tested using real-world data. The results, which show promise for protecting privacy without rendering the data set useless, appear in the September/October 2008 issue of the Journal of the American Medical Informatics Association.
“It’s not a theoretical problem,” says Khaled El Emam, PhD, associate professor at the University of Ottawa and Canada Research Chair in electronic health information, who collaborated with Fida Kamal Dankar, PhD, on the paper. “We’re trying to protect privacy, but we need the tools.”
Just as the nightly news renders the faces of anonymous sources unrecognizable, the approach known as k-anonymity blurs distinctive variables to reduce the risk that someone could trace patients with distinctive characteristics. For example, the approach might cut birthdates down to birth years. And easily identifiable outliers—the octogenarian in a college town, the teenager in a retirement community—are omitted. The remaining information contains at least k data points that look identical, where 1/k is deemed an acceptable level of risk.
That works in theory, but the actual risk depends on the type of data set and what an intruder wants from it. A prosecutor digging up dirt on a defendant would try to re-identify a specific person in the database. A journalist trying to discredit an organization’s data-security procedures would also only need to re-identify one person, but it wouldn’t matter who. El Emam set out to test whether k-anonymity works in both circumstances. His findings: k-anonymity correctly predicts the risk of re-identifying one specific individual with minimal harm to the value of the database (the prosecutor example). But using k-anonymity to protect against re-identifying an arbitrary person (the journalism example) is unnecessarily strict and compromises the research quality of the data.
Since researchers choose k based on statistical theory, El Emam suggests data custodians run test cases to verify if the k is sufficient, or if it’s overprotective, as in the journalism example, before making the data available to researchers. If needed, the number of groupings of k identical data points could then be adjusted to ensure that the actual risk approximates the theoretical risk of 1/k and, in this way, keep the risk acceptably low while preserving data.
“What is needed are the steps to turn this article into a practical tool that custodians can use in conjunction with researchers,” says Joan Roch, chief privacy officer for Canada Health Infoway in Montreal, Quebec.
El Emam says he plans to continue exploring actual risks in various data-security scenarios: “It’s a big problem, and we’ve solved part of it.