Preisträger 2010: Paul Ohm

Dieter Meur­er Prize Lecture

Paul Ohm

16 Sep­tem­ber 2010

When Dr. Her­berg­er informed me ear­li­er this year that I had been award­ed the Dieter Meur­er prize, I react­ed with a fair amount of sur­prise and healthy sus­pi­cion. As a schol­ar toil­ing away in rel­a­tive obscu­ri­ty in a small pub­lic law school in the foothills of the Rocky Moun­tains in the Unit­ed States, I could not fath­om that my work had been noticed in Ger­many, much less deemed wor­thy of award.

But as I learned more about the Asso­ci­a­tion for Com­put­ing in the Judi­cia­ry and about juris, my sense of sus­pi­cion fad­ed.  This is because the goals of these wor­thy and ven­er­a­ble insti­tu­tions over­lap so much and so well with my research agen­da. Like all of you, I am try­ing to build bridges between law and com­put­er sci­ence; togeth­er we are refus­ing to be sat­is­fied with ossi­fied and out­mod­ed legal tech­nolo­gies and approach­es; togeth­er we are point­ing the way to a new, bet­ter approach to law and judg­ing, one which embraces advances in technology.

The more I learned about the Asso­ci­a­tion and, in par­tic­u­lar, the work of Pro­fes­sor Her­berg­er and the late Pro­fes­sor Meur­er, the more my ini­tial sense of doubt turned into only immense grat­i­tude. I am also famil­iar with the amaz­ing work of many of the past recip­i­ents of this award, and I am hon­ored to be list­ed with them. So before I begin my sub­stan­tive remarks, I want­ed to thank Pro­fes­sor Her­berg­er, the Asso­ci­a­tion for Com­put­ing in the Judi­cia­ry, and juris for this hon­or. I am very grate­ful. I also want­ed to specif­i­cal­ly say thank you to Sabine Mic­ka, who helped orga­nize my travel.

When I am not writ­ing arti­cles that are also work­ing com­put­er pro­grams, I spe­cial­ize in infor­ma­tion pri­va­cy law. Recent­ly, I pub­lished an arti­cle in an Amer­i­can law jour­nal enti­tled “Bro­ken Promis­es of Pri­va­cy: Respond­ing to the Sur­pris­ing Fail­ure of Anonymization.”

The cen­tral con­tri­bu­tion of this Arti­cle is to incor­po­rate a new and excit­ing trend in com­put­er sci­ence research into law and pol­i­cy for the first time. In this sense, the Arti­cle fits very well with the goals of this conference.

The Arti­cle dis­rupts the cen­tral orga­niz­ing prin­ci­ple of infor­ma­tion pri­va­cy law, and because not every per­son in the room is an infor­ma­tion pri­va­cy expert, I ask the experts to indulge me as I explain. Infor­ma­tion pri­va­cy law relies heav­i­ly on the idea of anonymiza­tion, the term used to describe tech­niques used to pro­tect the pri­va­cy of peo­ple described in data­bas­es by dele­tion and fil­tra­tion. For exam­ple, we delete names, gov­ern­ment ID num­bers, birth dates, and home address­es from data­bas­es, moti­vat­ed by the idea that we can ana­lyze the data left behind with­out com­pro­mis­ing the iden­ti­ties of the peo­ple in the data. Anonymiza­tion pro­vides the best of both worlds, promis­ing both pri­va­cy and util­i­ty with one sim­ple and inex­pen­sive data­base operation.

Anonymiza­tion has been trust­ed not only by data­base own­ers but also by gov­ern­ments; leg­is­la­tors and reg­u­la­tors world­wide have cre­at­ed laws and rules which reward the use of anonymiza­tion. In fact, I claim that every sin­gle pri­va­cy law ever writ­ten rewards anonymiza­tion in some way, large or small, express or implied. In many cas­es, anonymiza­tion pro­vides a per­fect safe har­bor: anonymize your data and the law no longer applies at all.

This brings us to the cen­tral ten­ant of infor­ma­tion pri­va­cy law; the one cur­rent­ly under attack.

In the Unit­ed States, we refer to this as the con­cept of “Per­son­al­ly Iden­ti­fi­able Infor­ma­tion” or PII.

In the Euro­pean Data Pro­tec­tion Direc­tive, we encounter it through def­i­n­i­tions of the term “per­son­al data.”

The idea, no mat­ter what its name, is that we can pro­tect pri­va­cy by cat­e­go­riz­ing our data. We inspect our data­bas­es to sep­a­rate infor­ma­tion that iden­ti­fies a per­son from infor­ma­tion that does not. We approach this task almost like a biol­o­gist try­ing to clas­si­fy dif­fer­ent species of bird, or worm, or mush­room. Just as a mush­room sci­en­tist tries to divide the set of all mush­rooms into poi­so­nous and non-poi­so­nous, so too does his infor­ma­tion pri­va­cy coun­ter­part try to divide the set of data into dan­ger­ous and benign.

This was all a suc­cess­ful state of affairs for many decades. Thanks to the pow­er of anonymiza­tion, our pol­i­cy­mak­ers could rely on cat­e­go­riza­tion, PII, and per­son­al data to strike a bal­ance, one which guar­an­teed pri­va­cy while leav­ing room for busi­ness­es and researchers to use data to bet­ter the world and grow the economy.

[Pause]

Unfor­tu­nate­ly, the cen­tral premise upon which all of this rests—the pow­er of anonymization—has been attacked in the past decade.

Com­put­er sci­en­tists have repeat­ed­ly demon­strat­ed the sur­pris­ing pow­er of rei­den­ti­fi­ca­tion. By ana­lyz­ing the data that anonymiza­tion leaves behind, these researchers have shown that with only a bit of out­side infor­ma­tion and a bit of com­pu­ta­tion­al pow­er, we can restore iden­ti­ty in anonymized data.

Let me give you only two examples:

First, in 1995, a grad­u­ate stu­dent named Latanya Sweeney ana­lyzed a spe­cif­ic trio of information—a person’s birth date, U.S. postal code, and sex. She chose these three because many anonymized data­bas­es con­tained them, which was under­stand­able since most peo­ple thought they could be left behind with­out com­pro­mis­ing iden­ti­ty. The intu­itions of many—including most experts—suggested we shared our birth­date, postal code, and sex in com­mon with many oth­er peo­ple; we thought we could hide in a cloud of the many peo­ple who shared this information.

Dr. Sweeney proved these intu­itions wrong. By ana­lyz­ing U.S. cen­sus data, she deter­mined that 87% of Amer­i­cans were unique­ly iden­ti­fied by these three pieces of infor­ma­tion. What once was rec­og­nized as anonymized was sud­den­ly reject­ed as iden­ti­fy­ing. Today, many Amer­i­can laws reflect Dr. Sweeney’s work, by requir­ing the dele­tion of these three cat­e­gories of information.

But if reg­u­la­tors thought that Dr. Sweeney had proved that there was some­thing unusu­al or spe­cial about those three pieces of infor­ma­tion, they have learned quite recent­ly how wrong they were. There are many oth­er types of data that share this abil­i­ty to iden­ti­fy. In fact, some have sug­gest­ed that every piece of use­ful infor­ma­tion about a per­son can be used to iden­ti­fy, if it is con­nect­ed with the right piece of out­side information.

As only one exam­ple, con­sid­er an Amer­i­can com­pa­ny, Net­flix, which rents movies on DVD deliv­ered through the postal mails. On the Net­flix web­site, users rate the movies they have seen, to help Net­flix sug­gest oth­er movies they might enjoy. In an exper­i­ment in Inter­net collaboration—one I should add that has been cel­e­brat­ed for its many non-pri­va­cy-relat­ed contributions—Netflix released one hun­dred mil­lion records reveal­ing how a half mil­lion cus­tomers had rat­ed movies, but only after first anonymiz­ing the data to pro­tect identity.

A mere two weeks after the data release, two researchers named Arvind Narayanan and Vitaly Shmatikov announced a sur­pris­ing result: The movies we watch act like fin­ger­prints. If you know only a lit­tle about a Net­flix customer’s movie watch­ing habits, you have a good chance of dis­cov­er­ing his or her identity.

For exam­ple, the researchers dis­cov­ered that if you know six some­what obscure movies a per­son watched, and noth­ing else, you can iden­ti­fy 84% of the peo­ple in the data­base. If you know six movies they watched and the approx­i­mate date on which they rat­ed it, you can iden­ti­fy 99% of the people.

The les­son? When Net­flix cus­tomers are asked at a din­ner par­ty to list their six favorite obscure movies, they can­not answer unless they want every per­son at the table to be able to look up every movie they have ever rat­ed with Netflix.

But more seri­ous­ly, what is the broad­er lesson?

If we con­tin­ue to embrace the old PII/personal data approach to pro­tect­ing pri­va­cy, we will end up with worth­less laws, because these cat­e­gories will con­tin­ue to expand with each new advance in rei­den­ti­fi­ca­tion. For exam­ple, the Amer­i­can health pri­va­cy law, HIPPA, lists eigh­teen cat­e­gories of infor­ma­tion a health provider can delete to fall out­side the law. Should Amer­i­can reg­u­la­tors expand this list to con­tain movie rat­ings? Of course not; this would miss the point entirely.

Unfor­tu­nate­ly, in 15 min­utes, the best I can do is share the depress­ing and bleak part of the sto­ry. Time doesn’t allow me to share my solu­tions in detail except to say one thing: noth­ing we do to replace our old laws will share the pow­er and ease of solu­tions based on anonymiza­tion. Pre­serv­ing pri­va­cy will become even more dif­fi­cult than it is today. Beyond this note, I refer you to the paper to see the entire story.

This con­cludes my sub­stan­tive remarks. Please let me reit­er­ate: I am most grate­ful to have received this award. I look for­ward to meet­ing many of you and learn­ing from many of you through­out the day. Have a won­der­ful con­fer­ence, and once again, thank you.