Not So Open: Redefining Goals for Sharing Health Data in Research

The following is a guest blog post by Andy Oram, writer and editor at O’Reilly Media.

One couldn’t come away with more enthusiasm for open data than at this month’s Health Datapalooza, the largest conference focused on using data in health care. The whole 2000-strong conference unfolds from the simple concept that releasing data publicly can lead to wonderful things, like discovering new cancer drugs or intervening with patients before they have to go to the emergency room.

But look more closely at the health care field, and open data is far from the norm. The demonstrated benefits of open data sets in other fields–they permit innovation from any corner and are easy to combine or “mash up” to uncover new relationships–may turn into risks in health care. There may be better ways to share data.

Let’s momentarily leave the heady atmosphere of the Datapalooza and take a subway a few stops downtown to the Health Privacy Summit, where fine points of patient consent, deidentification, and the data map of health information exchange were discussed the following day. Participants here agree that highly sensitive information is traveling far and wide for marketing purposes, and perhaps even for more nefarious uses to uncover patient secrets and discriminate against them.

In addition to outright breaches–which seem to be reported at least once a week now, and can involve thousands of patients in one fell swoop–data is shared in many ways that arguably should be up to patients to decide. It flows from hospitals, doctors, and pharmacies to health information exchanges, researchers in both academia and business, marketers, and others.

Debate has raged for years between those who trust deidentification and those who claim that reidentification is too easy. This is not an arcane technicality–the whole industry of analytics represented at the Datapalooza rests on the result. Those who defend deidentification tend to be researchers in health care and the institutions who use their results. In contrast, many computer scientists outside the health care field cite instances where people have been reidentified, usually by combining data from various public sources.

Latanya Sweeney of Harvard and MIT, who won a privacy award this year at the summit, can be credited both with a historic reidentification of the records of Massachusetts Governor William Weld in 1997 and a more recent exposé of state practices. The first research led to the current HIPAA regime for deidentification, while the second showed that states had not learned the lessons of anonymization. No successful reidentifications have been reported against data sets that use recommended deidentification techniques.

I am somewhat perplexed by the disagreement, but have concluded that it cannot be resolved on technical grounds. Those who look at the current state of reidentification are satisfied that health data can be secured. Those who look toward an unspecified future with improved algorithms find reasons to worry. In a summit lunchtime keynote, Adam Tanner reported his own efforts as a …read more