Record Collection

Doctors want to merge health information with genomic data to better understand disease. But at what cost to privacy?

Medical research is entering the age of Big Data. Health records — doctors’ notes, diagnoses, prescriptions, blood pressure readings, etc. — are now gradually being digitized. Scientists are developing programs that can intelligently search the records of thousands of patients for significant patterns, such as risk factors for various diseases.

Another source of medical Big Data is lurking in our own cells: the human genome. The cost of sequencing a person’s genome has dropped to a thousand dollars and will continue to plummet. More and more volunteers are agreeing to have information about their DNA stored in online databases, making it possible for scientists to scan their genes for clues to maladies ranging from heart disease to schizophrenia.

The dream of many medical researchers is to merge these two kinds of Big Data: to be able to investigate a single database housing the medical records and genetic information for thousands of people.

A number of projects are underway to combine these two forms of Big Data. In Britain, for example, half a million people have volunteered to be a part of UK Biobank. In the United States, the National Institutes of Health has organized several medical centers with electronic health records into the Electronic Medical Records and Genomics (eMERGE) Network. Researchers who have analyzed the eMERGE data from 13,000 people have already discovered a number of new links between gene variants and diseases ranging from skin cancer to anemia.

But medical Big Data is different in one important way from Big Data for linguistics or archaeology or most other sciences: the matter of privacy. It’s hard to imagine anything more private than your health records or your genome. The prospect of having huge databases with both kinds of information linked in one place makes identity theft seem unobtrusive.

The prospect of poached health data has raised worries that people could lose their insurance or face discrimination for jobs. Already, safeguards such as the Genetic Information Nondiscrimination Act (or GINA, which protects against discrimination in health coverage and employment) have been put in place.

To ensure the privacy of their databases, Bradley Malin and his colleagues at Vanderbilt have spent years creating new safeguards. Vanderbilt’s Medical Center has been digitizing medical records since the early 1980s. To turn these records into a research database, Malin and his colleagues gave each patient a number. In 2007, Vanderbilt began stocking a bank of blood samples taken from the blood left over from tests. With the consent of patients, the researchers assigned the same numbers to the blood samples — and to the DNA sequences they later obtained from them. Once the medical records and DNA go into the research database, they no longer have a formal link to the patient.

Making information anonymous is harder than it seems when that information is medical. Clues about people’s identities can slip through in all sorts of forms. Vanderbilt’s research database of electronic health records contains every piece of information that the university’s medical system accrues about each person: X-rays, billable diagnostic codes, discharge forms, and so on. A doctor’s hand-scribbled notes may include a stray reference to a patient’s name. Even the dates on medical records could allow someone to discern a patient’s identity.

To combat this, Malin’s group redacted patient names from the medical records. They also changed and randomized the dates on every record. Scrambling the dates sacrificed some types of research. If a flu expert wanted to see how the Vanderbilt patient population fared during a flu season, the database would be useless. But researchers can still, for example, see how people with a certain gene variant react to a particular arthritis drug.

The scientists aren’t just protecting the privacy of their patients from an unscrupulous scientist or a hacker. There’s also the risk that government officials might demand to use the database to get information on an individual. So Malin and his team randomly drop a fraction of the patients from the database, with no way of knowing who’s in and who’s out. The government would have to analyze all the millions of records in order to find clues to one individual.

Such safeguards are important. Massive databases of medical and genetic information could help us understand human biology far better than we do today. Although everyone can benefit from those insights, scientists will need to ensure that everyone can’t probe our medical secrets along the way.