Dieter Meurer Prize Lecture
16 September 2010
When Dr. Herberger informed me earlier this year that I had been awarded the Dieter Meurer prize, I reacted with a fair amount of surprise and healthy suspicion. As a scholar toiling away in relative obscurity in a small public law school in the foothills of the Rocky Mountains in the United States, I could not fathom that my work had been noticed in Germany, much less deemed worthy of award.
But as I learned more about the Association for Computing in the Judiciary and about juris, my sense of suspicion faded. This is because the goals of these worthy and venerable institutions overlap so much and so well with my research agenda. Like all of you, I am trying to build bridges between law and computer science; together we are refusing to be satisfied with ossified and outmoded legal technologies and approaches; together we are pointing the way to a new, better approach to law and judging, one which embraces advances in technology.
The more I learned about the Association and, in particular, the work of Professor Herberger and the late Professor Meurer, the more my initial sense of doubt turned into only immense gratitude. I am also familiar with the amazing work of many of the past recipients of this award, and I am honored to be listed with them. So before I begin my substantive remarks, I wanted to thank Professor Herberger, the Association for Computing in the Judiciary, and juris for this honor. I am very grateful. I also wanted to specifically say thank you to Sabine Micka, who helped organize my travel.
When I am not writing articles that are also working computer programs, I specialize in information privacy law. Recently, I published an article in an American law journal entitled “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization.”
The central contribution of this Article is to incorporate a new and exciting trend in computer science research into law and policy for the first time. In this sense, the Article fits very well with the goals of this conference.
The Article disrupts the central organizing principle of information privacy law, and because not every person in the room is an information privacy expert, I ask the experts to indulge me as I explain. Information privacy law relies heavily on the idea of anonymization, the term used to describe techniques used to protect the privacy of people described in databases by deletion and filtration. For example, we delete names, government ID numbers, birth dates, and home addresses from databases, motivated by the idea that we can analyze the data left behind without compromising the identities of the people in the data. Anonymization provides the best of both worlds, promising both privacy and utility with one simple and inexpensive database operation.
Anonymization has been trusted not only by database owners but also by governments; legislators and regulators worldwide have created laws and rules which reward the use of anonymization. In fact, I claim that every single privacy law ever written rewards anonymization in some way, large or small, express or implied. In many cases, anonymization provides a perfect safe harbor: anonymize your data and the law no longer applies at all.
This brings us to the central tenant of information privacy law; the one currently under attack.
In the United States, we refer to this as the concept of “Personally Identifiable Information” or PII.
In the European Data Protection Directive, we encounter it through definitions of the term “personal data.”
The idea, no matter what its name, is that we can protect privacy by categorizing our data. We inspect our databases to separate information that identifies a person from information that does not. We approach this task almost like a biologist trying to classify different species of bird, or worm, or mushroom. Just as a mushroom scientist tries to divide the set of all mushrooms into poisonous and non-poisonous, so too does his information privacy counterpart try to divide the set of data into dangerous and benign.
This was all a successful state of affairs for many decades. Thanks to the power of anonymization, our policymakers could rely on categorization, PII, and personal data to strike a balance, one which guaranteed privacy while leaving room for businesses and researchers to use data to better the world and grow the economy.
Unfortunately, the central premise upon which all of this rests—the power of anonymization—has been attacked in the past decade.
Computer scientists have repeatedly demonstrated the surprising power of reidentification. By analyzing the data that anonymization leaves behind, these researchers have shown that with only a bit of outside information and a bit of computational power, we can restore identity in anonymized data.
Let me give you only two examples:
First, in 1995, a graduate student named Latanya Sweeney analyzed a specific trio of information—a person’s birth date, U.S. postal code, and sex. She chose these three because many anonymized databases contained them, which was understandable since most people thought they could be left behind without compromising identity. The intuitions of many—including most experts—suggested we shared our birthdate, postal code, and sex in common with many other people; we thought we could hide in a cloud of the many people who shared this information.
Dr. Sweeney proved these intuitions wrong. By analyzing U.S. census data, she determined that 87% of Americans were uniquely identified by these three pieces of information. What once was recognized as anonymized was suddenly rejected as identifying. Today, many American laws reflect Dr. Sweeney’s work, by requiring the deletion of these three categories of information.
But if regulators thought that Dr. Sweeney had proved that there was something unusual or special about those three pieces of information, they have learned quite recently how wrong they were. There are many other types of data that share this ability to identify. In fact, some have suggested that every piece of useful information about a person can be used to identify, if it is connected with the right piece of outside information.
As only one example, consider an American company, Netflix, which rents movies on DVD delivered through the postal mails. On the Netflix website, users rate the movies they have seen, to help Netflix suggest other movies they might enjoy. In an experiment in Internet collaboration—one I should add that has been celebrated for its many non-privacy-related contributions—Netflix released one hundred million records revealing how a half million customers had rated movies, but only after first anonymizing the data to protect identity.
A mere two weeks after the data release, two researchers named Arvind Narayanan and Vitaly Shmatikov announced a surprising result: The movies we watch act like fingerprints. If you know only a little about a Netflix customer’s movie watching habits, you have a good chance of discovering his or her identity.
For example, the researchers discovered that if you know six somewhat obscure movies a person watched, and nothing else, you can identify 84% of the people in the database. If you know six movies they watched and the approximate date on which they rated it, you can identify 99% of the people.
The lesson? When Netflix customers are asked at a dinner party to list their six favorite obscure movies, they cannot answer unless they want every person at the table to be able to look up every movie they have ever rated with Netflix.
But more seriously, what is the broader lesson?
If we continue to embrace the old PII/personal data approach to protecting privacy, we will end up with worthless laws, because these categories will continue to expand with each new advance in reidentification. For example, the American health privacy law, HIPPA, lists eighteen categories of information a health provider can delete to fall outside the law. Should American regulators expand this list to contain movie ratings? Of course not; this would miss the point entirely.
Unfortunately, in 15 minutes, the best I can do is share the depressing and bleak part of the story. Time doesn’t allow me to share my solutions in detail except to say one thing: nothing we do to replace our old laws will share the power and ease of solutions based on anonymization. Preserving privacy will become even more difficult than it is today. Beyond this note, I refer you to the paper to see the entire story.
This concludes my substantive remarks. Please let me reiterate: I am most grateful to have received this award. I look forward to meeting many of you and learning from many of you throughout the day. Have a wonderful conference, and once again, thank you.